key: cord-314498-zwq67aph authors: van heck, eric; vervest, peter title: smart business networks: concepts and empirical evidence date: 2009-05-15 journal: decis support syst doi: 10.1016/j.dss.2009.05.002 sha: doc_id: 314498 cord_uid: zwq67aph nan organizations are moving, or must move, from today's relatively stable and slow-moving business networks to an open digital platform where business is conducted across a rapidly-formed network with anyone, anywhere, anytime despite different business processes and computer systems. table 1 provides an overview of the characteristics of the traditional and new business network approaches [2] . the disadvantages and associated costs of the more traditional approaches are caused by the inability to provide relative complex, bundled, and fast delivered products and services. the potential of the new business network approach is to create these types of products and services with the help of combining business network insights with telecommunication capabilities. the "business" is no longer a self-contained organization working together with closely coupled partners. it is a participant in a number of networks where it may lead or act together with others. the "network" takes additional layers of meaningfrom the ict infrastructures to the interactions between businesses and individuals. rather than viewing the business as a sequential chain of events (a value chain), actors in a smart business network seek linkages that are novel and different creating remarkable, "better than usual" results. "smart" has a connotation with fashionable and distinguished and also with short-lived: what is smart today will be considered common tomorrow. "smart" is therefore a relative rather than an absolute term. smartness means that the network of co-operating businesses can create "better" results than other, less smart, business networks or other forms of business arrangement. to be "smart in business" is to be smarter than the competitors just as an athlete who is considered fast means is faster than the others. the pivotal question of smart business networks concerns the relationship between the strategy and structure of the business network on one hand and the underlying infrastructure on the other. as new technologies, such as rfid, allow networks of organizations almost complete insight into where its people, materials, suppliers and customers are at any point in time, it is able to organize differently. but if all other players in the network space have that same insight, the result of the interactions may not be competitive. therefore it is necessary to develop a profound understanding about the functioning of these types of business networks and its impact on networked decision making and decision support systems. the key characteristics of a smart business network are that it has the ability to "rapidly pick, plug, and play" to configure rapidly to meet a specific objective, for example, to react to a customer order or an unexpected situation (for example dealing with emergencies) [4] . one might regard a smart business network as an expectant web of participants ready to jump into action (pick) and combine rapidly (plug) to meet the requirements of a specific situation (play). on completion they are dispersed to "rest" while, perhaps, being active in other business networks or more traditional supply chains. this combination of "pick, plug, play and disperse" means that the fundamental organizing capabilities for a smart business network are: (1) the ability for quick connect and disconnect with an actor; (2) the selection and execution of business processes across the network; and (3) establishing the decision rules and the embedded logic within the business network. we have organized in june 2006 the second sbni discovery session that attracted both academics and executives to analyze and discover the smartness of business networks [1] . we received 32 submissions and four papers were chosen as the best papers that are suitable for this special issue. the four papers put forward new insights about the concept of smart business networks and also provide empirical evidence about the functioning and outcome of these business networks and its potential impact on networked decision making and decision support systems. the first paper deals with the fundamental organizing ability to "rapidly pick, plug, and play" to configure rapidly to meet a specific objective, in this case to find a solution to stop the outbreak of the severe acute respiratory syndrome (sars) virus. peter van baalen and paul van fenema show how the instantiation of a global crisis network of laboratories around the world cooperated and competed to find out how this deadly virus is working. the second paper deals with the business network as orchestrated by the spanish grupo multiasistencia. javier busquets, juan rodón, and jonathan wareham show how the smart business network approach with embedded business processes lead to substantial business advantages. the paper also shows the importance of information sharing in the business network and the design and set up of the decision support and infrastructure. the third paper focus on how buyer-seller relationships in online markets develop over time e.g. how even in market relationships buyers and sellers connect (to form a contract and legal relationship) and disconnect (by finishing the transaction) and later come back to each other (and form a relationship again). ulad radkevitch, eric van heck, and otto koppius identify four types of clusters in an online market of it services. empirical evidence reveals that these four portfolio clusters rely on either arms-length relationships supported by reverse auctions, or recurrent buying with negotiations or a mixed mode, using both exchange mechanisms almost equally (two clusters). the fourth paper puts forward the role and impact of intelligent agents and machine learning in networks and markets. the capability of agents to quickly execute tasks with other agents and systems will be a potential, sustainable and profitable strategy to act faster and better for business networks. wolf ketter, john collins, maria gini, alok gupta, and paul schrater identify how agents are able to learn from historical data and can detect different economic regimes, such as under-supply and over-supply in markets. therefore, agents are able to characterize the economic regimes of markets and forecast the next, future regime in the market to facilitate tactical and strategic decision making. they provide empirical evidence from the analysis of the trading agent competition for supply chain management (tac scm). we identify three important potential directions for future research. the first research stream deals with advanced network orchestration with distributed control and decision making. the first two papers indicate that network orchestration is a key critical component of successful business networks. research of intelligent agents is showing that distributed and decentralized decision making might provide smart solutions because it combines local knowledge of actors and agents in the network with coordination and control of the network as a whole. agents can help to reveal business rules in business networks, or gather pro-actively new knowledge about the business network and will empower the next generation of decision support systems. the second research stream deals with information sharing over and with network partners. for example, diederik van liere explores in his phd dissertation the concept of the "network horizon": the number of nodes that an actor can "see" from a specific position in the network [3] . most companies have a network horizon of "1". they know and exchange information with their suppliers and customers. however, what about the supplier of the suppliers, or the customer of their customers? one develops then a network horizon of "2". diederik van liere provides empirical evidence that with a larger network horizon a company can take a more advantageous network position depending on the distribution of the network horizons across all actors and up to a certain saturation point. the results indicate that the expansion of the network horizon will be in the near future a crucial success factor for companies. future research will shed more light on this type of network analysis and its impact on network performance. the third research stream will focus on the network platform with a networked business operating system (bos). most of the network scientists analyze the structure and dynamics of the business networks independent of the technologies that enable it to perform. it concentrates on what makes the network effective, the linked relationships between the actors, and how their intelligence is combined to reach the network's goals. digital technologies play a fundamental role in today's networks. they have facilitated improvements and fundamental changes in the ways in which organizations and individuals interact and combine as well as revealing unexpected capabilities that create new markets and opportunities. the introduction of new networked business operating systems will be feasible and these operating systems will go beyond the networked linking of traditional enterprise resource planning (erp) systems with customer relationship management (crm) software packages. implementation of a bos enables the portability of business processes and facilitates the end-to-end management of processes running across many different organizations in many different forms. it coordinates the processes among the networked businesses and its logic is embedded in the systems used by these businesses. smart business network initiative smart business networks: how the network wins network horizon and dynamics of network positions eric van heck holds the chair of information management and markets at rotterdam school of management, erasmus university, where he is conducting research and is teaching on the strategic and operational use of information technologies for companies and markets vervest is professor of business networks at the rotterdam school of management, erasmus university, and partner of d-age, corporate counsellors and investment managers for digital age companies firstly, we would like to thank the participants of the 2006 sbni discovery session that was held at the vanenburg castle in putten, the netherlands. inspiring sessions among academics and executives shed light on the characteristics and the functioning of smart business networks.secondly, we thank the reviewers of the papers for all their excellent reviews. we had an intensive review process and would like to thank the authors for their perseverance and hard work to create an excellent contribution to this special issue. we thank kevin desouza, max egenhofer, ali farhoomand, erwin fielt, shirley gregor, lorike hagdorn, chris holland, benn konsynski, kenny preiss, amrit tiwana, jacques trienekens, and dj wu for their excellent help in reviewing the papers.thirdly, we thank andy whinston for creating the opportunity to prepare this special issue of decision support systems on smart business networks. key: cord-307735-6pf7fkvq authors: walkey, allan j.; kumar, vishakha k.; harhay, michael o.; bolesta, scott; bansal, vikas; gajic, ognjen; kashyap, rahul title: the viral infection and respiratory illness universal study (virus): an international registry of coronavirus 2019-related critical illness date: 2020-04-29 journal: crit care explor doi: 10.1097/cce.0000000000000113 sha: doc_id: 307735 cord_uid: 6pf7fkvq the coronavirus disease 2019 pandemic has disproportionally strained intensive care services worldwide. large areas of uncertainly regarding epidemiology, physiology, practice patterns, and resource demands for patients with coronavirus disease 2019 require rapid collection and dissemination of data. we describe the conception and implementation of an intensive care database rapidly developed and designed to meet data analytic needs in response to the coronavirus disease 2019 pandemic—the multicenter, international society of critical care medicine discovery network viral infection and respiratory illness universal study. design: prospective cohort study and disease registry. setting: multinational cohort of icus. patients: critically ill patients with a diagnosis of coronavirus disease 2019. interventions: none. measurements and main results: within 2 weeks of conception of the society of critical care medicine discovery network viral infection and respiratory illness universal study, study leadership was convened, registry case report forms were designed, electronic data entry set up, and more than 250 centers had submitted the protocol for institutional review board approval, with more than 100 cases entered. conclusions: the society of critical care medicine discovery network viral infection and respiratory illness universal study provides an example of a rapidly deployed, international, pandemic registry that seeks to provide near real-time analytics and information regarding intensive care treatments and outcomes for patients with coronavirus disease 2019. objectives: the coronavirus disease 2019 pandemic has disproportionally strained intensive care services worldwide. large areas of uncertainly regarding epidemiology, physiology, practice patterns, and resource demands for patients with coronavirus disease 2019 require rapid collection and dissemination of data. we describe the conception and implementation of an intensive care database rapidly developed and designed to meet data analytic needs in response to the coronavirus disease 2019 pandemic-the multicenter, international society of critical care medicine discovery network viral infection and respiratory illness universal study. design: prospective cohort study and disease registry. setting: multinational cohort of icus. patients: critically ill patients with a diagnosis of coronavirus disease 2019. interventions: none. measurements and main results: within 2 weeks of conception of the society of critical care medicine discovery network viral infection and respiratory illness universal study, study leadership was convened, registry case report forms were designed, electronic data entry set up, and more than 250 centers had submitted the protocol for institutional review board approval, with more than 100 cases entered. conclusions: the society of critical care medicine discovery network viral infection and respiratory illness universal study provides an example of a rapidly deployed, international, pandemic registry that seeks to provide near real-time analytics and information regarding intensive care treatments and outcomes for patients with coronavirus disease 2019. key words: coronavirus disease 2019; registry t he coronavirus disease 2019 (covid-19) pandemic has introduced unprecedented challenges to healthcare systems worldwide. due to the effects of covid-19 on the respiratory system, geographic areas affected by the pandemic have experienced large surges in critically ill patients who require intensive care and multiple organ system support (1, 2) . in addition, case reports of medications hypothesized to reduce viral replication or systemic inflammation have spurred widespread off-label use without the usual level of evidence that has long been accepted in modern medicine, resulting in critical shortages of medications and frequently missing the opportunity to evaluate potential benefits as well as risks of such drugs (3) . large-scale data that enables rapid communication of patient characteristics, treatment strategies, and outcomes during a pandemic response would support nimble organizational planning and evaluation of effective critical care practices. we describe a novel icu database rapidly designed in response to the covid-19 pandemic to allow for near real-time data collection, analysis, and display-the multicenter, international society of critical care medicine (sccm) discovery network viral infection and respiratory illness universal study (virus). the sccm formed the discovery, the critical care research network, in 2016 to provide a central resource to link critical care clinical investigators to scale up research in critical illness and injury, providing both centralized networking and tangible resources including data storage and management, statistical support, grant writing assistance, and other project management needs. remarkably, the social media platform twitter (4) (5); daily twitter posts advertising the importance, availability, and purpose of the registry; and word of mouth. within 14 days of the initial twitter post bringing together the virus leadership team, the consortium ~250 sites across north america, south america, east, south, and western asia, africa, and europe had submitted institutional review board (irb) applications for participation in the registry (fig. 1) ; table 1 shows the number of sites reaching different landmarks to study participation during the first 2 weeks after a call for sites was announced. by week 2 after announcing the formation of the registry, data from more than 100 cases had been uploaded. the rapid enrollment of sites in the absence of external funding or support for the project is indicative of nearly universal enthusiasm across the international critical care community to collaborate across borders and silos in order to quickly learn from accumulating experience with covid-19. the overarching purpose of the sccm virus discovery database is to accelerate learning with regards to the epidemiology, physiology, and best practices in response to the covid-19 pandemic. the de-identified, hipaa compliant database was developed to capture both core data collection fields containing clinical information collected for all patients, and an enhanced data set of daily physiologic, laboratory, and treatment information collected by sites with available research support and/or infrastructure to allow for more intensive data collection (see data overview in fig. 2 , and detailed case report forms in appendix 1, supplemental digital content 1, http://links.lww.com/ccx/a163). the case report forms were adapted from the world health organization templates data collection forms (6), with edits to focus on an icuspecific context. case report forms went through rapid iterative editing to balance feasibility, efficiency, and comprehensiveness, with input from multiple clinical specialties. many challenges exist when initiating an international collaboration in multicenter clinical data sharing. the sccm virus discovery network will have a four pillar open science approach to data reporting and sharing. first, all centers have open access to their data for internal quality assurance and pilot studies. second, summary count data will be displayed on the sccm virus website (https://www.sccm.org/research/research/discovery-research-network/virus-covid-19-registry) as an interactive dashboard that will provide public reporting of real-time updates with regards to case counts, icu resource use, and outcomes. third, with appropriate data use agreements, investigators will be able apply to use the pooled multicenter data for independent research questions. fourth, the sccm covid-19 research team will identify urgent questions of clinical effectiveness, submit study protocols for independent methodological peer review, and design rigorous observational causal inference approaches (e.g., appropriate missing data methods, target trial emulation, use of directed acyclic graphs for covariate selection, quantitative sensitivity analyses [7] ) paired with data visualizations that produce real-time results displayed on the dashboard for immediate dissemination. we seek to facilitate a timely, democratized, and crowd-sourced discovery process, similar to icu databases such as medical information mart for intensive care (mimic-iii) (8) . the sccm virus discovery network will encourage that all research projects using consortium data post pre-prints in noncommercial archives, further facilitating rapid-reporting of research findings necessary for nimble response to a pandemic. we learned many lessons in a short period while setting up an international icu registry during a pandemic. strategies that worked to facilitate rapid progress included a strong social media presence, open communications and data harmonization with other research networks (e.g., national heart, lung, and blood institute prevention and early treatment of acute lung injury network), responsive irbs that identified the critical need for rapid approval of de-identified data collection in the setting of a pandemic, use of the established database infrastructure research electronic data capture (9) for construction of harmonized case report forms, as well as an academic-professional society partnership that facilitated rapid processing of data use agreements, and early set up of a central website for communication of study materials, frequently asked questions, and standard operating procedures. early review of literature and comparative research (10) of previous outbreaks was helpful in preparing the research resources; early reports from sentinel countries guided targeting of relevant data fields. in addition, assembly of a multidisciplinary leadership team enabled multiple stakeholder engagement and shared responsibility and mentorship across training levels. finally, a daily reminder to focus on the goals of the database-icu practices, physiology, and outcomes for patients with covid-19-helped to mitigate scope creep and allow for timely completion of the data infrastructure. strategies that may have improved the process included the use of a central irb, funding to support local data entry, and a preexisting team able to "flip the switch" on an existing infrastructure to immediately respond to a crisis. much of the world was relatively unprepared for the rapidly spreading covid-19 pandemic. four days after the pandemic was recognized and declared by the world health organization, we assembled an ad hoc team to initiate the registry of critically ill patients with covid-19 described herein. it is our profound hope that a similar registry will not be required in the future. however, it is likely that we will be applying lessons learned from covid-19 to future pandemics. our experience of quickly initiating the sccm discovery virus registry and moving from conception to data accrual within less than a month has taught us several valuable lessons-most important being that clinicians across the world want to donate their time for the greater good. as we continue to accrue data into the sccm discovery virus covid-19 registry, we anticipate that newly established infrastructure and networks will enable more nimble responses to data collection and discovery that allow us to learn from the past, and be better prepared for future pandemics. supplemental digital content is available for this article. direct url citations appear in the html and pdf versions of this article on the journal's website (http://journals.lww.com/ccejournal). dr. harhay is partially supported by national institutes of health/national heart, lung, and blood institute grant r00 hl141678. the remaining authors have disclosed that they do not have any potential conflicts of interest. for information regarding this article, e-mail: alwalkey@bu.edu clinical characteristics of coronavirus disease 2019 in china clinical course and risk factors for mortality of adult inpatients with covid-19 in wuhan, china: a retrospective cohort study fda authorizes widespread use of unproven drugs to treat coronavirus, saying possible benefit outweighs risk sccm virus covid-19 registry. 2020. available at isaric covid-19 clinical research resources. 2020. available at control of confounding and reporting of results in causal inference studies. guidance for authors from editors of respiratory, sleep, and critical care journals mimic-iii, a freely accessible critical care database research electronic data capture (redcap)-a metadata-driven methodology and workflow process for providing translational research informatics support guide to understanding the 2019 novel coronavirus key: cord-283793-ab1msb2m authors: chanchan, li; guoping, jiang title: modeling and analysis of epidemic spreading on community network with node's birth and death date: 2016-10-31 journal: the journal of china universities of posts and telecommunications doi: 10.1016/s1005-8885(16)60061-4 sha: doc_id: 283793 cord_uid: ab1msb2m abstract in this paper, a modified susceptible infected susceptible (sis) epidemic model is proposed on community structure networks considering birth and death of node. for the existence of node's death would change the topology of global network, the characteristic of network with death rate is discussed. then we study the epidemiology behavior based on the mean-field theory and derive the relationships between epidemic threshold and other parameters, such as modularity coefficient, birth rate and death rates (caused by disease or other reasons). in addition, the stability of endemic equilibrium is analyzed. theoretical analysis and simulations show that the epidemic threshold increases with the increase of two kinds of death rates, while it decreases with the increase of the modularity coefficient and network size. with the development of complex network theory, many social, biological and technological systems, such as the transportation networks, internet and social network, can be properly analyzed from the perspective of complex network. and many common characteristics of most real-life networks have been found out, e.g., small-world effect and scale-free property. for some kind of networks, the degree distributions have small fluctuations, and they are called as homogeneous networks [1] , e.g., random networks, small world networks and regular networks. in contrary to the homogeneous networks, heterogeneous networks [2] show power law distribution. based on the mean-field theory, many epidemic models, such as susceptible-infected (si), sis and susceptibleinfected-recovered/ removed (sir), have been proposed to describe the epidemic spreading process and investigate the epidemiology. it has been demonstrated that a threshold value exists in the homogeneous networks, while it is absent in the heterogeneous networks with sufficiently large size [3] . compared to the lifetime of individuals, the infectious period of the majority of infectious diseases is short. therefore, in most of the epidemic models, researchers generally choose to ignore the impact of individuals' birth and death on epidemic spreading. however, in real life, some infectious diseases have high death rate and may result in people's death in just a few days or even a few hours, such as severe acute respiratory syndrome (sars), hemagglutinin 7 neuraminidase 9 (h7n9) and the recent ebola. and some infectious diseases may have longer spreading time, like hbv, tuberculosis. besides, on the internet, nodes' adding and removing every time can also be treated as nodes' birth and death. in ref. [4] , liu et al. analyzed the spread of diseases with individuals' birth and death on regular and scale-free networks. they find that on a regular network the epidemic threshold increases with the increase of the treatment rate and death rate, while for a power law degree distribution network the epidemic threshold is absent in the thermodynamic limit. sanz et al. have investigated a tuberculosis-like infection epidemiological model with constant birth and death rates [5] . it is found that the constant change of the network topology which caused by the individuals' birth and death enhances the epidemic incidence and reduces the epidemic threshold. zhang et al. considered the epidemic thresholds for a staged progression model with birth and death on homogeneous and heterogeneous networks respectively [6] . in ref. [7] , an sis model with nonlinear infection rate, as well as birth and death of nodes, is investigated on heterogeneous networks. in ref. [8] , zhu et al. proposed a modified sis model with a birth-death process and nonlinear infection rate on an adaptive and weighted contact network. it is indicated that the fixed weights setting can raise the disease risk, and that the variation of the weight cannot change the epidemic threshold but it can affect the epidemic size. recently, it has been revealed that many real networks have the so-called community structure [9] , such as social networks, internet and citation networks. a lot of researchers focus on the study of epidemic spreading on community structure networks. liu et al. investigated the epidemic propagation in the sis model on homogeneous network with community structure. they found that community structure suppress the global spread but increase the threshold [10] . many researchers studied the epidemic spreading in scale-free networks with community structure based on different epidemic model, such as si model [11] , sis model [12] , sir model [13] [14] and susceptible exposed asymptomatically infected recovered (seair) model [15] . chu et al. investigated the epidemic spreading in weighted scale-free networks with community structure [16] . in ref. [17] , shao et al. proposed an traffic-driven sis epidemic model in which the epidemic pathway is decided by the traffic of nodes in community structure networks. it is found that the community structure can accelerate the epidemic propagation in the traffic-driven model, which is different from the traditional model. the social network has the property of community structure and some infectious diseases have high mortality rates or long infection period, while the previous studies only consider the impact of one of the aforementioned factors. so in this paper, we study the epidemic spreading in a modified sis epidemic model with birth and death of individuals on a community structure network. the rest of this paper is organized as follows. in sect. 2, we introduce in detail the network model and epidemic spreading process, and discuss the network characteristics either. in sect. 3, mean-field theory is utilized to analyze the spreading properties of the modified sis epidemic model. sect. 4 gives some numerical and simulations which support the theoretical analysis. at last, sect. 5 concludes the paper. as there exists the phenomena of the individual's birth and death in real networks, the topology of the network changes over time. we consider undirected and unweighted graphs in this paper. the generating algorithm of the network with community structure can be summarized as follows: we assume that each site of this network is empty or occupied by only one individual. 2) the probability to have a link between the individuals (non-empty sites) in the same community is p i . 3) we create a link between two nodes (non-empty sites) belonging to different communities with probability p e . 4) every site has its own state and may change with the evolution of epidemic. in each time step, susceptible individuals and infected individuals may respectively die with probability α and β, meanwhile, the corresponding site becomes empty, and the links of these sites are broken. 5) for each empty site, a susceptible individual may be born with probability b, and then it create links with other individuals with probability p i in the same community or p e belonging to different communities. suppose the initial number of edges is k, then we have: the state transition rules of the transmission process are schematically shown in fig. 1 . all the sites of the network are described as parameters: e, s or i, which respectively represent the empty states, susceptible individual occupations and infected individual occupations. the specific process are as follows: an empty site can give birth to a healthy individual at rate b; a healthy individual can be infected by contacting with infected neighbors at rate λ or die at rate α (due to other reasons); an infected individual can be cured at rate γ or die at rate β (on account of the disease). when an individual dies, this site becomes empty. in general, β>α, and all parameters above are non-negative. fig. 1 the schematic diagram of state transition rules an important measurement for community structure networks is the modularity coefficient [18] . it is defined as follows: where ij e denotes the proportion of edges between community i and j in the total network edges. so ii e and ij j e ∑ can be described as follows: where k represents the total edge number. thus, for our model the modularity coefficient is: therefore, for the given parameters of m, i n and k, combining eqs. (1) and (5), we can adjust the values of i p and p e to get community structure networks with various modularity q. for the network has time-varying topology, it is necessary to characterize the network's characteristics. we plot the curves of average degree 〈k〉, average path length l and average clustering coefficient c of networks changing with time. in fig. 2 , the lateral axis denotes time step, a time step is equal to one second. according to the statistics of birth and death rates of our country in recent years, we can approximately assume that the birth rate 0.01 b = and the natural death rate =0.01. α for different infectious diseases have different mortality rate and the mortality rate is affected by many factors (such as the region and personal habit), so we set the disease death rate in addition, the network size is 1 000. as shown in fig. 2 , the larger the network's link number k is, the higher the clustering coefficient c is, and the smaller the average path length l is. and the statistical property values remain unchanged with small β. this is because isolated nodes are not easily generated when the disease death rate is sufficiently small. the simulation results are averaged over 100 simulations. let parameters s, i represent the density of healthy individuals and infected individuals of the entire network. s i , i i are respectively the density of the susceptible and infected nodes within community i. based on the classical sis model [19] , we establish a modified sis epidemic model considering the characteristic of community structure. in addition, the circumstances of node's birth and death are taken into consideration either in this model. therefore, this epidemic model can be established as follows: in eq. therefore, eqs. (6) and (7) can be written as: let d d 0 s t = and d d 0 i t = , we get two steady state solutions: for the first solution, the jacobin matrix is: the determinant and the trace of j are: (14) if 0 > j , then tr 0 < j , and the solution is stable. then we can get the critical value: for the second solution, the jacobin matrix is: (16) where a is the same as above. clearly, if ( )( ) when c λ λ > , the second solution is stable, and the disease will diffuse in the network, otherwise the disease will die out. from eq. (17), we find that the threshold value is governed byα, β and b in a given network. in this section, we make a set of monte-carlo simulations on n-node networks to find the relationships between epidemic size and different parameters, such as modularity coefficient, death rate, birth rate and total edge number. the following simulation results are averaged over 100 configurations with different set of random numbers i n (i=1, 2,…, m). and for each configuration, 200 simulations are taken with one randomly chosen seed node initially. fig. 3 shows the time evolution curves of epidemic size, where β equals to 0, 0.001 and 0.005 respectively. some related parameters are n=1 000, m=10, k=10 000, q=0.3, λ=0.1, b=0.01, α=0.01 . it is shown that when β≠0, the epidemic size increases to a peak value then decays to tend a stable value, otherwise the epidemic size keeps increase and finally reach a steady state. the existence of disease death rate can prevent the spread of the disease by decreasing the infected fraction directly. the maximum prevalence of epidemic spreading without considering nodes' disease deaths is the largest. in addition, larger β corresponds to smaller stable epidemic size, which agrees well with the reality. fig. 4 shows the critical epidemic value decreases with the increase of birth rate, while the epidemic prevalence increases with the increase of birth rate. the arrows in fig. 4 indicate the theoretic epidemic threshold calculated through eq. (17) . eq. (17) clearly shows that the birth rate is inversely proportional to the critical value, which is consistent with the simulation results in fig. 4 . in real life, with the increase of birth rate, the density of whole population and healthy proportion increases, which makes it easier for infectious disease to diffuse. next, we plot the curves to indicate the influence of two kinds of death rates (natural death rate α and disease death rate β) on the epidemic threshold and average disease prevalence. the arrows in fig. 5 and 6 indicate the theoretic epidemic threshold. fig. 5 , β constantly equal to 0.05. for some infectious diseases, such as acquired immune deficiency syndrome (aids), it is necessary to consider the situation of individuals' natural deaths. from fig. 5 , we find that the existence of natural death rate α is conducive to prevent the spread of the disease, and the increase of threshold and decrease of epidemic size are expected with the increase of α. individuals' natural death decreases the density of total population, thus restrains the propagation of epidemic. the arrows in fig. 5 indicate the theoretic epidemic threshold. fig. 6 shows the effect of the existence of individuals' death caused by disease on epidemic threshold. the related parameters are b=0.005, q=0.3, k=5 000, and α=0.005. by comparisons, it is found that the epidemic threshold increases with the growing of β, while the epidemic size decreases with the growing of β. the existing of disease deaths can rapidly reduce the number of infected individual in populations, thus the existence of disease death rate can inhibits the epidemic spreading. in fig. 7 , we study the effects of both modularity coefficient q and the edge number of network k on the epidemic threshold. larger k represents that the individuals in network are linked more closely. it is found that the epidemic threshold decreases with the increase of the modularity coefficient of the network, and the epidemic size of the network with higher modularity coefficient is larger around the epidemic threshold, while the inverse situation occurs when the infection rate is far greater than the threshold. fig. 7 the relationship between i∞ and λ with different modularity coefficient q and edge number k this is because the infectious disease is mainly transmitted within the community, and when the propagation rate is sufficiently, the infectious disease spreads throughout the network through the edges between communities. the edge density of network with higher modularity coefficient is small, this is not conducive to the spread between communities, thereby reducing the spreading size of the entire network. in addition, the epidemic threshold has inverse correlation with the total edge number k. this is consistent with the real network circumstances. considering the circumstances of node's birth and death that may exist in real networks, a modified epidemic model based on the classical sis model is proposed in a community structure network. an approximate formula for the epidemic threshold is obtained by mathematical analysis to find the relative relationships between different parameters. then the stability of endemic equilibrium is analyzed. the simulations in this study illustrate that the epidemic threshold λ increases with the increase of the death rate (natural death or disease death), while it decreases with the increase of the birth rate, modularity coefficient and edge number. through this study, it is helpful to predict the spreading trend of some infectious diseases that may cause the deaths of individuals (such as ebola and h7n9) more accurately than ever before. collective dynamics of 'small-world' networks emergence of scaling in random networks epidemic dynamics and endemic states in complex networks the spread of disease with birth and death on networks spreading of persistent infections in heterogeneous populations staged progression model for epidemic spread on homogeneous and heterogeneous networks global attractivity of a network-based epidemic sis model with nonlinear infectivity epidemic spreading on contact networks with adaptive weights proceedings of the 3rd international conference on image and signal processing (icisp'08) photoocr: reading text in uncontrolled conditions characterize energy impact of concurrent network-intensive applications on mobile platforms iodetector: a generic service for indoor outdoor detection the case for vm-based cloudlets in mobile computing community structure in social and biological networks epidemic spreading in community networks epidemic spreading in scale-free networks with community structure how community structure influences epidemic spread in social networks community structure in social networks: applications for epidemiological modeling a stochastic sir epidemic on scale-free network with community structure epidemic spreading on complex networks with community structure epidemic spreading in weighted scale-free networks with community structure traffic driven epidemic spreading in homogeneous networks with community structure finding and evaluating community structure in networks epidemic outbreaks in two-scale community networks this work was supported by the national natural science key: cord-200354-t20v00tk authors: miya, taichi; ohshima, kohta; kitaguchi, yoshiaki; yamaoka, katsunori title: experimental analysis of communication relaying delay in low-energy ad-hoc networks date: 2020-10-29 journal: nan doi: nan sha: doc_id: 200354 cord_uid: t20v00tk in recent years, more and more applications use ad-hoc networks for local m2m communications, but in some cases such as when using wsns, the software processing delay induced by packets relaying may not be negligible. in this paper, we planned and carried out a delay measurement experiment using raspberry pi zero w. the results demonstrated that, in low-energy ad-hoc networks, processing delay of the application is always too large to ignore; it is at least ten times greater than the kernel routing and corresponds to 30% of the transmission delay. furthermore, if the task is cpu-intensive, such as packet encryption, the processing delay can be greater than the transmission delay and its behavior is represented by a simple linear model. our findings indicate that the key factor for achieving qos in ad-hoc networks is an appropriate node-to-node load balancing that takes into account the cpu performance and the amount of traffic passing through each node. an ad-hoc network is a self-organizing network that operates independently of pre-existing infrastructures such as wired backbone networks or wireless base stations by having each node inside the network behave as a repeater. it is a kind of temporary network that is not intended for longterm operation. every node of an ad-hoc network needs to be tolerant of dynamic topology changes and have the ability to organize the network autonomously and cooperatively. because of these specific characteristics, since the 1990s, ad-hoc networks have played an important role as a mean for instant communication in environments where the network infrastructure is weak or does not exist, such as developing countries, disaster areas, and battle fields. however, in recent years, the ad-hoc network is also a hot topic in urban areas where the broadband mobile communication systems are well developed and always available. more and more applications use ad-hoc networks for local m2m communications, especially in key technologies that are expected to play a vital role in future society, such as intelligent transportation systems (its) supporting autonomous car driving, cyber-physical systems (cps) like smart grids, wireless sensor networks (wsn), and applications like the iot platform. these days, communication entities are shifting from humans to things; the network infrastructures tend to require a more strict delay guarantee, and the ad-hoc network is no exception. there have been many prior studies about delayaware communication in the field of ad-hoc networks [1] [4] . most of these focus on the link delay and only a few consider both node and link delays [1] , [2] . however, in some situations where the power consumption is severely limited (e.g., with wsn), the communication relaying cost of small devices with low-power processors may not be negligible for the end-to-end delay of each communication. it is necessary to discuss, on the basis of actual data measured on wireless ad-hoc networks, how much the link and node delays account for the end-to-end delay. in the field of wired networks, there have been many studies reporting measurement experiments of packet processing delay as well as various proposals for performance improvement [5] [10] . in addition, the best practice of qos measurement has been discussed in the ietf [11] . in the past, measurement experiments on asic routers have been carried out for the purpose of benchmarking routers working on isp backbones [5][7] ; in contrast, since the software router has emerged as a hot topic in the last few years, recent studies mainly concentrate on the bottleneck analysis of the linux kernel's network stack [8] [10] . there has also been a study focusing on the processing delay caused by the low-power processor assuming interconnection among small robots [12] . however, as far as we know, no similar measurement exists in the field of wireless ad-hoc networks. therefore, many processing delay models have been considered so far, e.g., simple linear approximation [13] or queueing model-based nonlinear approximation [14] , but it is hard to determine which one is the most reasonable for wireless ad-hoc networks. in this work, we analyze the communication delay in an adhoc network through a practical experiment using raspberry pi zero w. we assume an energy-limited ad-hoc network composed of small devices with low-power processors. our goal is to support the design of qos algorithms on adhoc networks by clarifying the impact of software packet processing on the end-to-end delay and presenting a general delay model to which the measured delay can be adapted. this is an essential task for future ad-hoc networks and their related technologies. first, we briefly describe the structure of the linux kernel network stack in sect. ii. we explain the details of our measurement experiment in sects. iii and iv, and report the results in sect. v. we conclude in sect. vi with a brief summary and mention of future work. in this section, we present a brief description of the linux kernel's standard network stack from the viewpoints of the packet receiving and sending sequences. figure 1 shows the flow of packets in the network stack from the perspective of packet queueing. first, as the preparation for receiving packets, the nic driver allocates memory resources in ram that can store a few packets, and has packet descriptors (rx descriptors) hold these addresses. the rx ring buffer is a descriptor ring located in ram, and the driver notifies the nic of the head and tail addresses of the ring. the nic then fetches some unused descriptors by direct memory access (dma) and waits for the packets to arrive. the workflow after the packet arrival is as follows. as a side note, the below sequence is a receiving mechanism called new api (napi) supported in linux kernel 2.6 or later. i) once a packet arrives, nic writes the packet out as an sk buff structure to ram with dma, referring to the rx descriptors cached beforehand, and issues a hardirq after the completion. ii) the irq handler receiving hardirq pushes it by napi_schedule() to the poll list of a specific cpu core and then issues softirq so as to get the cpu out of the interrupt context. iii) the soft irq scheduler receiving softirq calls the interrupt handler net_rx_action() at the best timing. iv) net_rx_action() calls poll(), which is implemented in not the kernel but the driver, for each poll list. v) poll() fetches sk buff referring to the ring indirectly and pushes it to the application on the upper layer. at this time, packet data is transferred from ram to ram; that is, the data is copied from the memory in the kernel space to the receiving socket buffer in the user space by memcpy(). repeat this memory copy until the poll list becomes empty. vi) the application takes the payload from the socket buffer by calling recv(). this operation is asynchronous with the above workflows in the kernel space. the packet receiving sequence is completed when all the payloads have been retrieved. in the packet sending sequence, all the packets basically follow the reverse path of the receiving sequence, but they are stored in a buffer called qdisc before being written to the tx ring buffer (fig. 1) . the ring buffer is a simple fifo queue that treats all arriving packets equally. this design simplifies the implementation of the nic driver and allows it to process packets fast. qdisc corresponds to the abstraction of the traffic queue in the linux kernel and makes it possible to achieve a more complicated queueing strategy than fifo without modifying the existing codes of the kernel network stack or drivers. qdisc supports many queueing strategies; by default, it runs in pfifo_fast mode. if the packet addition fails due to a lack of free space in qdisc, the packet is pushed back to the upper layer socket buffer. as discussed in sect. i, the goal of this study is to evaluate the impact of software packet processing, induced by packet relaying, to the end-to-end delay, on the basis of an actual measurement assuming an ad-hoc network consisting of small devices with low-power processors. figure 2 shows our experimental environment, whose details are described in sect. iv. we define the classification of communication delays as below. both processing delay and queueing delay correspond to the application delay in a broad sense. • end-to-end delay: total of node delays and link delays • node delay: sum of processing delay, queueing delays, and any processing delays occurring in the network stack the proxy node (fig. 2) relays packets with the three methods below, and we evaluate the effect of each in terms of the end-to-end delay. by comparing the results of olsr and at, we can clarify the delay caused by packets passing through the network stack. • kernel routing (olsr): proxy relays packets by kernel routing based on the olsr routing table. in this case, the relaying process is completed in kernel space because all packets are wrapped in l3 of the network stack. accordingly, both processing delay and queueing delay defined above become zero, and node delay is purely equal to the processing delay on the network stack in the kernel space. • address translation (at): proxy works as a tcp/udp proxy, and all packets are raised to the application running in the user space. the application simply relays packets by switching sockets, which is equivalent to a fixed-length header translation. • encryption (enc): proxy works as a tcp/udp proxy. besides at, the application also encrypts payloads using aes 128-bit in ctr mode so that the relaying load depends on the payload size. for each relaying method, we conduct measurements with variations of the following conditions. we express all the results as multiple percentile values in order to remove delay spikes. because the experiment takes several days, we record the rssi of the ad-hoc network including five surrounding channels. • payload size • packets per second (pps) • additional cpu load (stress) in this section, we explain the technical details of the experimental environment and measurement programs. we use three raspberry pi zero ws (see table i for the hardware specs). the linux distributions installed on the raspberry pis are raspbian and the kernel version is 4.19.97+. we use olsr (rfc3626), which is a proactive routing protocol, and adopt olsrd as its actual implementation. since all three of the nodes are location fixed, even if we used a reactive routing protocol like aodv instead of olsr, only the periodic hello in olsr will change the periodic rreq induced by the route cache expiring; that is, in this experiment, whether the protocol is proactive or reactive does not have a significant impact on the final results. the ad-hoc network uses channel 9 (2.452 ghz) of ieee 802.11n, transmission power is fixed to -31 dbm, and bandwidth is 20 mhz. as wpa (tkip) and wpa2 (ccmp) do not support ad-hoc mode, the network is not encrypted. although the three nodes can configure an olsr mesh, as they are located physically close to each other, we have the sender/receiver drop olsr hello from the receiver/sender as well as the arp response by netfilter so that the network topology becomes a logically inline single-hop network, as show in fig. 2 . we use iperf as a traffic generator and measure the udp performance as it transmits packets from sender to receiver via proxy. the iperf embeds two timestamps and a packet id in the first 12 bytes of the udp data section (fig. 3) , and the following measurement programs we implement use this id to identify each packet. random data are generated when iperf starts getting entropy from /dev/urandom, and the same series is embedded in all packets. we create a loadable kernel module using netfilter and measure the queueing delay in receiving and sending udp socket buffers. the workflow is summarized as follows: the module hooks up the received packets with nf_inet_pre_routing and the sent packets with nf_inet_post_routing ( fig. 1) , retrieves the packet ids iperf marked by indirectly referencing the sk buff structure, and then writes them out to the kernel ring buffer via printk() with a timestamp obtained by ktime_get(). the proxy program is the application running in the user space. it creates af_inet sockets between sender and proxy as well as between proxy and receiver and then translates ip addresses and port numbers by switching sockets. furthermore, it records the timestamps obtained by clock_gettime() immediately after calling recv() and sendto(), and encrypts every payload data protecting the first 12 bytes of metadata marked by iperf so as not to be rewritten. the above refers to the udp proxy; the tcp proxy we prepare simply using socat. we execute a dummy process whose cpu utilization rate is limited by cpulimit as a controlled noise of the user space in order to investigate and clarify its impact on the node delay. we performed the delay measurement experiments under the conditions shown in table ii using the methods described in the previous section. due to the space constraints, we omit the results of the preliminary experiment. note that all experiments were carried out at the author's home; due to the japanese government's declaration of the covid-19 state of emergency, we have had to stick to the "stay home" initiative unless absolutely necessary. the experiment was divided into nine measurements. figure 4a shows the time variation of rssi during a measurement. we were unable to obtain snrs owing to the specifications of the wi-fi driver, and thus the noise floors were unknown, but the essids observed in the five surrounding channels were all less than -80 dbm. the rssi variabilities were also within the range that did not affect the modulation and coding scheme (mcs) [15] ; therefore, it appears that the link quality was sufficiently high throughout all measurements. figures 4b, 4c , and 4d shows the average time variations of node delay, which were the results under the condition of 1000 bytes, 200 pps, and 0% stress. the blue highlighted bars indicate upper outliers (delay spikes) detected with a hampel filter (σ = 3). there were 53 outliers in olsr, 115 in at, and 9 in enc. in general, when the cpu receives periodic interrupts (e.g., routing updates, snmp requests, gcs of ram), packet forwarding is paused temporarily so that the periodic delay spikes can be observed in the end-to-end delay. this phenomenon is called the "coffee-break effect" [7] and has been mentioned in several references [5] , [8] , [9] . for this experiment, as seen in the results of at (fig. 4c ), in the low-energy ad-hoc networks, it is evident that the cpurobbing by other processes like coffee-break had a significant impact on the communication delay. incidentally, there were fewer spikes under both 1) olsr and 2) enc than under at. 1) since the packet forwarding was completed within the kernel space, node delay was less susceptible to applications running in the user space. 2) since the payload encryption was overwhelmingly cpu-intensive, the influence of other applications was hidden and difficult to observe from the node delay. figures 5a and 5b shows the jitter of one-way communication delay. lines represent the average values, and we filled in the areas between the minimum and the maximum. there were no significant differences between olsr and at, which suggests that lifting packets to the application layer does not affect jitter. jitter increased in proportion to the payload size only in the case of enc. similarly, only in the case of enc with 200 pps or more, the packet loss rate tended to increase with payload size, drawing a logarithmic curve as seen in fig. 5c ; in all other cases, no packet loss occurred regardless of the conditions. figure 6 shows the tendency of the node delay variation against several conditions, and fig. 7 shows the likelihood of occurrence as empirical cdf. according to these figures, in the cases of olsr and at, the delay was nearly constant irrespective of pps and stress. there was a correlation between the variation and pps in olsr, while in at there was not; this suggests that the application-level packet forwarding is less stable than kernel routing from the perspective of node delay. in the case of enc, the processing delay increased to the millisecond order and increased approximately linearly with respect to the payload size, and the delay variance became large overall. in addition, the graph tended to be smoothed as the pps increased; this arises from the fact that packet encryption takes up more cpu time, which makes the influence of other processes less conspicuous. it appears that the higher the pps, the lower the average delay ( fig. 6a and 6b) , and the delay variance decreases around 1200 bytes (fig. 6c) , but the causes of these remain unknown, and further investigation is required. one thing is certain: on the raspberry pi, pulling the packets up to the application through the network stack results in a delay of more than 100 microseconds. figure 8 shows the breakdown of the end-to-end delay and also describes the node delay link delay ratio (nlr). as we saw in fig. 2 , for this experimental environment, the end-toend delay included two link delays, and the link delay shown in fig. 8 is the sum of them. the link delay was calculated from the effective throughput reported in iperf. as iperf does not support pps as its option, we achieved it by adjusting the amount of transmitted traffic, as the results showed that, in the cases of olsr and at, the nlr was almost constant with respect to the payload size, while in enc, it showed an approximately linear increase. the nlr was less than 5% in olsr, while in at, it was around 30%, which cannot be considered negligible. furthermore, node delay was greater than link delay when the payload size was over 1200 bytes in enc. in this work, we have designed and conducted an experiment to measure the software processing delay caused by packets relaying. the experimental environment is based on an olsr ad-hoc network composed of raspberry pi zero ws. the results were qualitatively explainable, and suggested that, in low-energy ad-hoc networks, there are some situations where the processing delay cannot be ignored. • the relaying delay of kernel routing is usually negligible, but when it is handled by application, the delay can be more than ten times greater, however simple the task is. • if an application performs cpu-intensive tasks such as encryption or full translation of protocol stacks, the delay increases according to the linear model and can be greater than the link's transmission delay. for this reason, node-to-node load balancing considering the cpu performance or amount of passing traffic could be extremely useful for achieving delay-guaranteed routing in adhoc networks. particularly in heterogeneous ad-hoc networks (hanets), where each node's hardware specs are different from each other, the accuracy of passing node selection would have a significant impact on the end-to-end delay. as we did not take any noise countermeasures in this experiment, our future work will involve similar measurements in an anechoic chamber to reduce the noise from external waves and an investigation of the differences in results. the upper limit of flow accommodation under allowable delay constraint in hanets a unified solution for gateway and in-network traffic load balancing in multihop data collection scenarios qos-aware routing based on bandwidth estimation for mobile ad hoc networks qos based multipath routing in manet: a cross layer approach measurement and analysis of single-hop delay on an ip backbone network dx: latencybased congestion control for datacenters experimental assessment of end-to-end behavior on internet measurement of processing and queuing delays introduced by an open-source router in a single-hop network a study of networking software induced latency scheme to measure packet processing time of a remote host through estimation of end-link capacity a one-way delay metric for ip performance metrics (ippm) real-time linux communications: an evaluation of the linux communication stack for real-time robotic applications characterizing network processing delay processor-sharing queues: some progress in analysis ieee 802.11 n/ac data rates under power constraints key: cord-285872-rnayrws3 authors: elgendi, mohamed; nasir, muhammad umer; tang, qunfeng; fletcher, richard ribon; howard, newton; menon, carlo; ward, rabab; parker, william; nicolaou, savvas title: the performance of deep neural networks in differentiating chest x-rays of covid-19 patients from other bacterial and viral pneumonias date: 2020-08-18 journal: front med (lausanne) doi: 10.3389/fmed.2020.00550 sha: doc_id: 285872 cord_uid: rnayrws3 chest radiography is a critical tool in the early detection, management planning, and follow-up evaluation of covid-19 pneumonia; however, in smaller clinics around the world, there is a shortage of radiologists to analyze large number of examinations especially performed during a pandemic. limited availability of high-resolution computed tomography and real-time polymerase chain reaction in developing countries and regions of high patient turnover also emphasizes the importance of chest radiography as both a screening and diagnostic tool. in this paper, we compare the performance of 17 available deep learning algorithms to help identify imaging features of covid19 pneumonia. we utilize an existing diagnostic technology (chest radiography) and preexisting neural networks (darknet-19) to detect imaging features of covid-19 pneumonia. our approach eliminates the extra time and resources needed to develop new technology and associated algorithms, thus aiding the front-line healthcare workers in the race against the covid-19 pandemic. our results show that darknet-19 is the optimal pre-trained neural network for the detection of radiographic features of covid-19 pneumonia, scoring an overall accuracy of 94.28% over 5,854 x-ray images. we also present a custom visualization of the results that can be used to highlight important visual biomarkers of the disease and disease progression. on march 11, 2020, the world health organization declared the covid-19 virus as an international pandemic (1) . the virus spreads among people via physical contact and respiratory droplets produced by coughing or sneezing (2) . the current gold standard for diagnosis of covid-19 pneumonia is real-time reverse transcription-polymerase chain reaction (rt-pcr). the test itself takes about 4 h; however, the process before and after running the test, such as transporting the sample and sending the results, requires a significant amount of time. pertaining to pcr testing is not a panacea, as the sensitivities range from 70 to 98% depending on when the test is performed during the course of the disease and the quality of the sample. in certain regions of the world it is simply not routinely available. more importantly, the rt-pcr average turnaround time is 3-6 days, and it is also relatively costly at an average of ca$4, 000 per test (3) . the need for a faster and relatively inexpensive technology for detecting covid-19 is thus crucial to expedite universal testing. the clinical presentation of covid-19 pneumonia is very diverse, ranging from mild to critical disease manifestations. early detection becomes pivotal in managing the disease and limiting its spread. in 20% of the affected patient population, the infection may lead to severe hypoxia, organ failure, and death (4) . in order to meet this need, high-resolution computed tomography (hrct) and chest radiography (cr, known as chest x-ray imaging) are commonly available worldwide. patterns of pulmonary parenchymal involvement in covid-19 infection and it's progression in the lungs has been described in multiple studies (5) . however, despite the widespread availability of xray imaging, there is unfortunately a shortage of radiologists in most low-resource clinics and developing countries to analyze and interpret these images. for this reason, artificial intelligence and computerized deep learning that can automate the process of image analysis have begun to attract great interest (6) . note that x-ray costs about ca$40 per test (3), making it a cost effective and readily available option. moreover, the x-ray machine is portable, making it versatile to be utilized in all areas of the hospital even in the intensive care unit. since the initial outbreak of the covid-19, a few attempts have been made to apply deep learning to radiological manifestations of covid-19 pneumonia. narin et al. (7) reported an accuracy of 98% on a balanced dataset for detecting covid-19 after investigating three pretrained neural networks. sethy and behera (8) explored 10 different pre-trained neural networks, reporting an accuracy of 93% on a balanced dataset, for detecting covid-19 on x-ray images. zhang et al. (9) utilized only one pretrained neural network, scoring 93% on an unbalanced dataset. hemdan et al. (10) looked into seven pre-trained networks, reporting an accuracy of 90% on a balanced dataset. apostolopoulos and bessiana (11) evaluated five pre-trained neural networks, scoring 98% of accuracy on an unbalanced dataset. however, these attempts did not make clear which existing deep learning method would be the most efficient and robust for covid-19 compared to many others. moreover, some of these studies were carried out on unbalanced datasets. note that an unbalanced dataset is a dataset where the number of subjects in each class is equal. our study aims to determine the optimal learning method, by investigating different types of pre-trained networks on a balanced dataset, for covid-19 testing. additionally, we attempt to visualize the optimal network weights, which were used for decision making, on top of the original x-ray image to visually represent the output of the network. we investigated 17 pre-trained neural networks: alexnet, squeeznet (12) , googlenet (13), resnet-50 (14) , darknet-53 (15) , , shufflenet (16) , nasnet-mobile (17), xception (18), place365-googlenet (13), mobilenet-v2 (19) , , , inception-resnet-v28 (21), inception-v3 (22) , resnet-101 (14) , and vgg-19 (23) . all the experiments in our work were carried out in matlab 2020a on a workstation (gpu nvidia geforce rtx 2080ti 11 gb, ram 64 gb, and intel processor i9-9900k @3.6 ghz). the dataset was divided into 80% training and 20% validation. the last fully connected layer was changed into the new task to classify two classes. the following parameters were fixed for the 17 pre-trained neural networks: learning rate was set to 0.0001, validation frequency was set to 5, max epochs was set to 8, and the min batch size was set to 64. the class activation mapping was carried by multiplying the image activations from the last relu layer by the weights of the last fully connected layer of the darknet-19 network, called "leaky18, " as follows: where c is the class activation map, l is the layer number, f is the image activations from relu layer (l = 60) with dimensions of 8 × 8 × 1, 024. here, w refers to the weights at l = 61 with dimensions of 1 × 1 × 1, 024. thus, the dimensions of c is 8 × 8. we then resized c to match the size of the original image and visualized it using a jet colormap. two datasets are used, the first dataset is the publicly available coronahack-chest x-ray-dataset which can be downloaded from this link: https://www.kaggle.com/ praveengovi/coronahack-chest-xraydataset. this dataset contains the following number of images: 85 covid-19, 2,772 bacterial, and 1,493 viral pneumonias. the second dataset is a local dataset collected from an accredited level i trauma center: vancouver general hospital (vgh), british columbia, canada. the dataset contains only 85 covid x-ray images. the coronahack -chest x-ray-dataset contains only x-ray 85 images for covid, and to balance the dataset for neural network training, we had to downsize the sample size from 85 to 50 by random selection. to generate the "other class, " we downsized the samples, by selecting 50 radiographic images that were diagnosed as healthy to match and balance the covid-19 class. radiographs labeled as bacterial or other viral pneumonias have also been included in the study to assess specificity. the number of images used in training and validation to retrain the deep neural network is shown in table 1 . data collected from the vancouver general hospital (vgh) contained 58 chest radiographs with pulmonary findings ranging from subtle to severe radiographic abnormality, which was confirmed by two radiologists individually on visual assessment with final interpretations with over 30 years of radiology experience combined. these 58 radiographs were obtained from 18 rt-pcr-positive covid-19 patients. serial radiographs acquired during a patient's hospital stay showing progressive disease were also included in the data set. the data set contained anteroposterior and posteroanterior projections. portable radiographs acquired in intensive care units with lined and tubes in place were also included in the data set. the images (true positive) submitted for the analysis by the vgh team were anonymized and mixed with an equal number of normal chest radiographs to create a balanced data set. the remaining from the coronahack-chest x-ray-dataset was used to test the specificity of the algorithm. dataset 2 was used an external dataset to test the robustness of the algorithm, with a total of 5,854 x-ray images (58 covid-19, 1,560 healthy, 2,761 bacterial, and 1,475 viral pneumonias), as shown in table 1 . note that there is no overlap between dataset 1 and dataset 2. to determine the optimal existing pre-trained neural network for the detection of covid-19, we used the coronahack-chest x-ray-dataset. the chest x-ray images dataset contains 85 images from patients diagnosed with covid-19 and 1,576 images from healthy subjects. five x-ray images collected from the lateral position were deleted for consistency. we then balanced the dataset to include 50 rt-pcr positive covid-19 patients and 50 healthy subjects. from the group of 85 rt-pcr positive cases patients were randomly selected with varying extent of pulmonary parenchymal involvement. after creating a balanced dataset, which is important for producing solid findings, 17 pretrained networks were analyzed following the framework shown in figure 1 . the 17 pre-trained neural networks were trained on a large data set by using more than a million images, as a result the algorithms developed can classify new images into 1,000 different object categories, such as keyboard, mouse, pencil, and various animals. through artificial intelligence and machine learning each network can detect images based on unique features representative of a particular category. by replacing the last fully connect layer, as shown in figure 1 , and retraining (fine-tune table 2 . interestingly, we found that the following two pre-trained neural networks achieved an accuracy of 100% during the training and validation phases using dataset 1: resnet-50 and darknet-19. inception-v3 and shufflenet achieved an overall validation accuracy below 90% suggesting that these neural networks are not robust enough for detecting covid-19 compared to, for example, resnet-50 and darknet-19. despite that the inception-renet-v2 was pre-trained on trained on more than a million images from the imagenet database (21), it was not ranked the highest in terms of the overall performance, suggesting it is not suitable to use for detecting covid-19. each pre-trained network has a structure that is different from others, e.g., number of layers and size of input. the most important characteristics of a pre-trained neural network are as follows: accuracy, speed, and size (24) . greater accuracy increases the specificity and sensitivity for covid-19 detection. increased speed allows for faster processing. smaller networks can be deployed on systems with less computational resources. therefore, the optimal network is the network that increases accuracy, utilizes less training time, and that is relatively small. typically, there is a tradeoff between the three characteristics, and not all can be satisfied at once. however, our results show that it is possible to satisfy all three requirements. darknet-19 outperformed all other networks, while having increased speed and increased accuracy in a relatively small-sized network, as shown in figure 2 . a visual comparison between all investigated pre-trained neural networks is presented, with respect to the three characteristics. the x-axis is the training time (logarithmic scale) in seconds, the y-axis is the overall validation accuracy and the bubble size represents the network size. note that darknet-19 and resnet-50 achieved an accuracy of 100%; however, darknet is much faster and requires less memory. a comparison of optimal neural networks recommended in previous studies, along with the optimal neural network suggested by this work, is shown in table 3 . narin et al. (7) used a balanced sample size of 100 subjects (50 covid-19 and 50 healthy). they investigated three pre-trained neural networks: resnet50, inceptionv3 and inceptionresnetv2, with a cross validation ratio of 80-20%. they found that resnet50 outperformed the other two networks, scoring a validation accuracy of 98%. sethy and behera (8) it is worth noting that the studies discussed in table 3 did not use other populations, such as pneumonia bacterial to test specificity. moreover, they did not use an external dataset to test reliability. in other words, they had only training and validation datasets. note that we used two datasets: dataset 1 for training and validation and dataset 2 for testing. interestingly, resnet-50 network achieved a high accuracy in three different studies. note that these studies only compared resnet-50 to a select few neural networks, whereas here we compared a total of 17. one possible reason that our resnet-50 achieved 100% is that the dataset (dataset 1) in our study differed from the datasets in other studies. another reason is the network's parameter settings (e.g., learning rate). however, darknet-19 also achieved a validation accuracy of 100%, and it is not clear which network is more accurately detect radiographic abnormalities associated with covid-19 pneumonia. two approaches will be used to compare the performance between the darknet-19 and resnet-50 networks: (1) model fitting and (2) performance over dataset 2. 1. model fitting: achieving a good model fitting is the target behind any learning algorithm by providing a model that does not suffer from either over-fitting and under-fitting (25) . typically, a "good fitted" model is obtained when both training and validation loss curves decrease to a stability zone where the gap between the loss curves is minimal (25) . this gap is referred to as the "generalization gap, " and it can be seen in figure 3 ; the gap between the loss curves in darknet-19 is smaller than the gap in resnet-50. this suggests that darknet-19 is more optimal when compared to resnet-19 even though both achieved 100% accuracy of the training and validation images using dataset 1. 2. performance over the testing dataset: in this step, the reliability and robustness of darknet-19 and resnet-50 over dataset 2 will be examined. as can be seen in table 4 , both neural networks were able to differentiable the pattern as we are interested in finding the model that achieved high sensitivity with minimal generalization gap, the optimal neural network to be used is the darknet-19. availability of efficient algorithms to detect and categorize abnormalities on chest radiographs into subsets can be a useful adjunct in the clinical practice. darknet-19's accuracy to detect radiographic patterns associated with covid-19 in portable and routine chest radiographs at varied clinical stages makes it a robust and useful tool. use of such efficient algorithms in everyday clinical practice can help address the problem of shortage of skilled manpower, contributing to provision of better clinical care. more institution-based research is, however, required in this area. while the darknet-19 algorithm can distinguish covid-19 patients from other populations with 94.28% accuracy, we note the following limitations: 1. the covid sample size used in the training and validation phase was relatively small, 50 images. frontiers in medicine | www.frontiersin.org 2. the images were not segregated based on the technique of acquisition (portable or standard supine ap chest radiograph) or positioning (posteroanterior vs. anteroposterior). thus, any possible errors that might arise because of the patient's positioning have not been addressed in the study. lateral chest radiographs were excluded from the data set. 3. our investigation compared radiographic features of covid-19 patients to healthy individuals. as a next step in our investigation, the radiographic data from covid-19 patients should also be compared with other respiratory infections in order to improve the specificity of the algorithm for detection of covid-19. an important component to the automated analysis of the x-ray data is the visualization of the x-ray images, using colors to identify the critical visual biomarkers as well as indication of disease progression. this step can make disease identification more intuitive and easier to understand, especially for healthcare workers with minimal knowledge about covid-19. the visualization can also expedite the diagnosis process. as shown in figure 4 (true positive), covid-19 subjects were identified based on the activation images and weights. also, examples for false positive (a non-covid subject identified as a covid), false negative (a covid subject identified as non-covid), and true negative (a non-covid subject identified as a non-covid subject) were shown. note that the main purpose of this paper is not to investigate the difference between pre-trained and trained neural networks; the purpose is rather to provide a solution that is based on already existing and proven technology to use for covid screening. if the accuracy achieved by the pre-trained neural network is not acceptable by radiologists, then exploring different untrained convolutional neural networks could be worth doing. also, including the patient's demographic information, d-dimer, oxygen saturation level, troponin level, neutrophil to lymphocyte ratio, glucose level, heart rate, degree of inspiration, and temperature may improve the overall detection accuracy. in conclusion, fast, versatile, accurate, and accessible tools are needed to help diagnose and manage covid-19 testing infection. the current gold standard laboratory tests are time consuming and costly, adding delays to the testing process. chest radiography is a widely available and affordable tool for screening patients with lower respiratory symptoms or suspected covid-19 pneumonia. addition of computer-aided radiography can be a useful adjunct in improving throughput and early diagnosis of the disease; this is especially true during a pandemic, particularly during the surge, and in areas with a shortage of radiologists. in this paper, we have reviewed and compared many deep learning techniques currently available in the market for detecting radiographic features of covid-19 pneumonia. after investigating 17 different pre-trained neural networks, our results showed that darknet-19 is the optimal pre-trained deep learning network for detection of imaging patterns of covid-19 pneumonia on chest radiographs. work to improve the specificity of these algorithms in the context of other respiratory infections is ongoing. the coronahack-chest x-ray-dataset used in this study is publicly available and can be downloaded from this link: https://www.kaggle.com/praveengovi/coronahack-chestxraydataset. requests to access the dataset collected at vancouver general hospital should be directed to savvas nicolaou, savvas.nicolaou@vch.ca. dataset 1 and all trained neural networks can be accessed via this link https://github.com/ elgendi/covid-19-detection-using-chest-x-rays. me designed the study, analyzed the data, and led the investigation. mn, wp, and sn provided an x-ray dataset, annotated the x-ray images, and checked the clinical perspective. me, mn, qt, rf, nh, cm, rw, wp, and sn conceived the study and drafted the manuscript. all authors approved the final manuscript. this research was supported by the nserc grant rgpin-2014-04462 and canada research chairs (crc) program. the funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results. who declares covid-19 a pandemic clinical features of patients infected with 2019 novel coronavirus in wuhan cost analysis of multiplex pcr testing for diagnosing respiratory virus infections characteristics of and important lessons from the coronavirus disease 2019 (covid-19) outbreak in china: summary of a report of 72,314 cases from the chinese center for disease control and prevention chest ct findings in patients with coronavirus disease 2019 and its relationship with clinical features deep learning in radiology automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks detection of coronavirus disease (covid-19) based on deep features covid-19 screening on chest x-ray images using deep learning based anomaly detection covidx-net: a framework of deep learning classifiers to diagnose covid-19 in x-ray images covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks alexnet-level accuracy with 50× fewer parameters and <0.5 mb model size. arxiv going deeper with convolutions deep residual learning for image recognition open source neural networks in c an extremely efficient convolutional neural network for mobile devices learning transferable architectures for scalable image recognition xception: deep learning with depthwise separable convolutions mobilenetv2: inverted residuals and linear bottlenecks densely connected convolutional networks inception-v4, inception-resnet and the impact of residual connections on learning inception-v3 for flower classification very deep convolutional networks for large-scale image recognition imagenet large scale visual recognition challenge learning curve models and applications: literature review and research directions key: cord-324256-5tzup41p authors: feng, shanshan; jin, zhen title: infectious diseases spreading on a metapopulation network coupled with its second-neighbor network date: 2019-11-15 journal: appl math comput doi: 10.1016/j.amc.2019.05.005 sha: doc_id: 324256 cord_uid: 5tzup41p traditional infectious diseases models on metapopulation networks focus on direct transportations (e.g., direct flights), ignoring the effect of indirect transportations. based on global aviation network, we turn the problem of indirect flights into a question of second neighbors, and propose a susceptible-infectious-susceptible model to study disease transmission on a connected metapopulation network coupled with its second-neighbor network (snn). we calculate the basic reproduction number, which is independent of human mobility, and we prove the global stability of disease-free and endemic equilibria of the model. furthermore, the study shows that the behavior that all travelers travel along the snn may hinder the spread of disease if the snn is not connected. however, the behavior that individuals travel along the metapopulation network coupled with its snn contributes to the spread of disease. thus for an emerging infectious disease, if the real network and its snn keep the same connectivity, indirect transportations may be a potential threat and need to be controlled. our work can be generalized to high-speed train and rail networks, which may further promote other research on metapopulation networks. with the rapid development of technology, the rate of globalization has increased, which brings not only opportunities to countries, but also many challenges, such as, the global transmission of infectious diseases. for example, severe acute respiratory syndromes (sars) [1] , originating from guangdong province, china, then spread around the world along international air travel routes. influenza a (h1n1) flu [2] in 2009, which was first reported in mexico, has become a global issue, which is followed by the emergence of avian influenza [3] , middle east respiratory syndrome coronavirus (mers-cov) [4] , ebola virus disease [5] , zika [6] . the outbreak of any infectious disease has a great impact on humans, either physically, mentally, or economically. how to forecast and control the global spread of infectious diseases has always been the focus of research. one of effective methods to address this problem is the introduction of metapopulation networks. node neighbors second neighbors a metapopulation network is a network, whose nodes (subpopulations) represent well-defined social units, such as countries, cities, towns, villages, with links standing for the mobility of individuals. using heterogeneous mean-field (hmf) theory and assuming that subpopulations with the same degree are statistical equivalent, colizza and vespignani proposed two models to describe transmission of diseases on heterogeneous metapopulation networks under two different mobility patterns, which sheds light on calculation of global invasion threshold [7] . next, different network structures, including bipartite metapopulation network [8] , time-varying metapopulation network [9] , local subpopulation structure [10] , interconnected metapopulation network [11] , have been found to play an essential role in the global spread of infectious diseases. furthermore, studies have shown that adaptive behavior of individuals contributes to the global spread of epidemics, contrary to willingness [12] [13] [14] [15] . these works mostly focus on large-scale, air-travel-like mobility pattern, without individuals going back to their origins. there are also some studies on recurrent mobility pattern. balcan and vespignani investigated invasion threshold on metapopulation network with recurrent mobility pattern [16, 17] . heterogeneous dwelling time in subpopulations was considered in ref. [18] . nearly all studies above are under the assumption that mobility between two linked subpopulations is based on direct flights or other direct transportations. for aviation networks, sometimes there is no any direct flight when people travel, which makes them have to traverse to other places before reaching their destinations. even in the case of a direct flight, individuals may have to make two or more stops before reaching their destinations. actually, these two cases reflect the same problem of individual transfer in a metapopulation network. for an emerging infectious disease, taking transfer once as an example, movement of infectious individuals may result in more susceptible subpopulations being infected, since infectious individuals in an infected subpopulation can arrive not only at its neighbors but also the neighbors of its neighbors. to address this problem, we define second neighbor and second-neighbor network (snn) on an arbitrary undirected network. then we investigate the spread of an infectious disease on a connected metapopulation network coupled with its snn and study how indirect flights affect the global spread of the infectious disease. we show that the behavior that individuals travel along the metapopulation network coupled with its snn contributes to the spread of disease. the paper is organized as follows. in section 2 , we introduce second neighbor and give some definitions on snn of an arbitrary undirected network. next, an infectious disease model is derived to study how transfer rate affects the global transmission of a disease in section 3 . further, the basic reproduction number and the stability analysis of model are given in section 4 . section 5 presents some simulation results. conclusions are given in section 6 . in order to investigate the effect of indirect flights on disease transmission on a metapopulation network, we introduce the concepts of second neighbor and snn in the following. definition 2.1. a second neighbor of node i in a network (undirected) is a node whose distance from i is exactly two. according to the definition above, j is a second neighbor of i means that there exists(exist) self-avoiding path(s) of length two from i to j . as illustrated in fig. 1 , the number of self-avoiding paths of length two between two nodes may be larger than 1. since what we focus on is the existence of these paths but not the number, we say these paths are equivalent if the number of self-avoiding paths of length two is larger than 1. in a similar way, one can define third neighbor and k th neighbor. a third neighbor of node i in a network (undirected) is a node whose distance from i is exactly three. a k th ( k > 3) neighbor of node i in a network (undirected) is a node whose distance from i is exactly k . based on the definitions above, we give the definition of snn. an snn for a undirected network is composed by all second neighbors of nodes. in other words, an snn keep the same nodes with the given network, and there existing a link between two nodes means that one node is a second neighbor of the other. fig. 2 illustrates a undirected network and its snn. in panel a, for node 1, for example, its second neighbors are nodes 2 and 6 according to the definition of second neighbor. in the same way, we obtain all second neighbors for each node and construct an snn (see panel b). similarly, one can get a third-neighbor network, a fourth-neighbor network, and so on. consider a simple undirected network with n nodes and let us label the nodes with integer labels 1 , . . . , n . the adjacency matrix a = (a i j ) n ×n is a matrix with entries accordingly, the number of second neighbors of node i , named as next-nearest degree, is k (2) i = n j=1 b i j . similarly, we use p (2) k to denote the probability that the number of second neighbors is exactly k . notice that where a (2) i j (≥ 0) is the number of paths with length two between nodes i and j . with the definition of second neighbor, b can be uniquely expressed by a and a 2 with the matrix b is symmetric due to the fact that the matrix a is symmetric. for complete graphs and networks with all nodes' degrees being 0 or 1, b = 0 , that is, b is a zero matrix. for an aviation network, more common is indirect flights with one stop except for direct flights, which is due to the fact that people are rarely willing to have more than one stop during their journey. actually, the network constructed by all these indirect flights is exactly the snn of the aviation network. thus, to solve the problem of these indirect flights is equivalent to fixing the problem of the snn for a metapopulation network. consider a connected and undirected metapopulation network with n subpopulations coupled with its snn, and an acute infectious disease (such as, influenza) with a susceptible-infectious-susceptible (sis) transmission process intra subpopulation. as illustrated in fig. 3 , each node represents a population in which individuals with different disease states (blue circle for susceptible individuals, red pentagram for infectious individuals) are well-mixed, while links represent that there exists individuals' mobility between two nodes. here dashed links represent the second neighbor relationship. on the other hand, metapopulation networks are weighted, which measures traffic flows between two linked subpopulations. for node i , let the weight of a link be defined as the probability at which individuals in node i travel along the link. in consideration of a general form, weight matrix for a metapopulation network is of the form . . , n ) and equality holds when a i j = 0 . in addition, the matrix w (1) satisfies the condition that each row sum equals one, that is, n j=1 a i j w (1) i j = 1 . in the same way, weight matrix for its snn takes the form furthermore, for disease spreading process intra subpopulation, let β denote transmission rate, and γ denote recovery rate of an infectious individual. referring to mobility process inter subpopulations, mobility rate at which an individual leaves a given subpopulation to its neighbors or second neighbors is denoted by δ. to depict the case of individuals transfer, we denote transfer rate by q , the rate at which an individual leaves a given subpopulation to its second neighbors, so the rate of an individual leaving a given subpopulation to its neighbors is 1 − q . we note that q = 0 when k (2) i = 0 . we assume that these rates keep the same for all subpopulations and that these rates are all at unit time: per day. upon these bases, we consider the following model for eq. (3.1a) , the first and the second terms represent disease transmission and recovery processes in a given subpopulation i , respectively. meanwhile, the latter three terms express the mobility process inter subpopulations. in detail, the fourth term is on behalf of the case that individuals arrive at subpopulation i from its neighbors, and the fifth term shows the case of individuals traveling along snn. the process of individuals in subpopulation i traveling to other subpopulations, including neighbors and second neighbors, is described by the third term, which is equivalent to the expression this model is the traditional metapopulation network model [19] . remark 3.2. q = 1 portrays the case that all individuals travel along the snn of the metapopulation network, which corresponds to the situation that governments prohibit direct flights when an infectious disease occurs or the case in areas with underdeveloped economy and poor traffic, and it is expressed by for the metapopulation network, assuming that subpopulations with the same degree and next-nearest degree are statistically equivalent and that link weights depend on degree (for the metapopulation network) and next-nearest degree (for the snn) of nodes, according to ref. [20] , we obtain an equivalent mean-field model s k (1) ,k (2) = −β s k (1) ,k (2) i k (1) ,k (2) n k (1) ,k (2) (1) ,l (2) , (1) ,l (2) , (3.4b) here subscripts are degree k (1) and next-nearest degree k (2) , respectively. n k (1) ,k (2) is the average population of subpopulations with the same degree k (1) and the same next-nearest degree k (2) , and definitions of s k (1) ,k (2) and i k (1) ,k (2) are similar. (1) ) denotes the conditional probability that a subpopulation with degree k (1) is connected to a subpopulation of degree l (1) , and p (2) ( l (2) | k (2) ) is a similar definition on snn to p (1) ( l (1) | k (1) ). summing eqs. (3.1a) and (3.1b) gives a ji w (1) ji n j + δq n j=1 b ji w (2) ji n j , i = 1 , . . . , n. thus, (−m) is a singular m-matrix. from (3.6) , letting n = n i =1 n i , we obtain that the total population n is constant (because n = 0 ). subject to this constraint, by theorem 3.3 in [21] , we show that (3.6) has a unique positive equilibrium n i = n * i , which is globally asymptotically stable. since we are only interested in the asymptotic dynamics of the global transmission of disease on the metapopulation network coupled with its snn, we will study the limiting system of (3.1) (3.7) in this section, we calculate the basic reproduction number, and prove the existence and stability of disease-free equilibrium (dfe) and endemic equilibrium (ee). before studying the global stability of dfe, we calculate the basic reproduction number following the approach of van den driessche and watmough [22] . obviously, there exists a unique dfe e 0 = (0 , . . . , 0) for system (3.7) . according to eq. (3.7) , the rate of appearance of new infections f and the rate of transfer of individuals out of the compartments v in the e 0 are given by here f and v are n × n matrices. using the next-generation matrix theory [22] , the basic reproduction number is r 0 = ρ(f v −1 ) , where ρ is the spectral radius of the matrix f v −1 . in the following, we calculate the basic reproduction number r 0 . note that the sum of each column of matrix v is γ and the matrix v is column diagonally dominant. so v is an irreducible nonsingular m-matrix. thus v −1 is a positive matrix. matrix v has column sum γ , i.e., 1 t has column sum β/ γ . by theorem 1.1 in chapter 2 in ref. [23] , the basic reproduction number is the threshold value r 0 depends only on disease parameters β and γ but not on mobility rate δ or transfer rate q , thus, mobility of individuals has no impact on the basic reproduction number. however, movements of individuals between subpopulations accelerates the global spread of infectious diseases on metapopulation networks [24] . notice that if r 0 < 1, then e 0 is locally asymptotically stable; while r 0 > 1, e 0 is unstable. in fact, we can further prove the global stability of e 0 . we give the following lemma first. proof. first, we will show that i i ( t ) > 0 for any t > 0 and i = 1 , . . . , n and initial value i (0) ∈ n . otherwise assume that there exist an i 0 ∈ { 1 , . . . , n } and t 0 > 0, such that then i i 0 (t * ) > 0 , but the definition of t * implies i i 0 (t * ) ≤ 0 , which is a contradiction. second, we show that for any t ≥ 0 , i i (t) ≤ n * i and i = 1 , . . . , n . for any initial value i (0) ∈ n , let x i (t) = n * i − i i (t) . according to (3.7) , we have the following system: we will show that for any t > 0, x i ( t ) > 0. if this is not true, there exist an i 0 (1 ≤ i 0 ≤ n ) and t 0 > 0, such that obviously, this is also a contradiction. thus i i 0 (t) ≤ n * define an auxiliary linear system, namely, (4.1) the right side of (4.1) has coefficient matrix f since (4.1) is a linear system, the dfe of this system is globally asymptotically stable. by the comparison principle, each non-negative solution of (3.7) satisfies lim t→ + ∞ notice that e 0 is locally asymptotically stable, thus e 0 is globally asymptotically stable. next, we study the existence and global stability of ee to system (3.7) . proof. to prove the existence and global stability of endemic equilibrium, we will use cooperate system theory in corollary 3.2 in [25] . in fact, let f : n → n be defined by the right-hand side of (3 .7) , note that for ∀ α ∈ (0, 1) and i i > 0, thus f is strong sublinear on n . by lemma 2 and corollary 3.2 in [25] , we conclude that system (3.7) admits a unique ee e * = (i * 1 , . . . , i * n ) which is globally asymptotically stable. prior to simulating an infectious disease spreading on a metapopulation network coupled with its snn, it is necessary to make it clear that what network topology is and which distribution the next-nearest degree follows. in [26] , newman derived an expression of probability p (2) k as follows: here q k means that the probability of excess degree is exactly k , and it is given by q k = (k + 1) p (1) k +1 / k . for a network with small size of nodes or simple structures (such as regular network), this probability is easily calculated. while for a general complex network, it is complex to calculate p (2) k directly. the introduction of generating function makes this problem easier, but extracting explicit probability distribution for next-nearest degrees is quite difficult. in fig. 4 , we illustrate p (2) k for two kinds of networks with average degree 7: homogeneous networks whose degrees follow a poisson distribution and heterogeneous networks with degrees power-law distributed, which makes the distribution of next-nearest degrees clearer. obviously, heterogeneity of network structures makes a big difference on p (2) k . referring to degrees of nodes poisson distributed, the probability distribution of next-nearest degrees is almost symmetric about k = 49 , the average next-nearest degree (the average number of second neighbors). in contrast, in the case that degrees of nodes follow a power-law distribution, next-nearest degrees present high heterogeneity with k range from 4 to 537. however, there is something in common that p (2) in section 4 , we calculate r 0 , which is independent of transfer rate, meaning that transfer rate has little impact on the stability of the system. hence, to know the significance of transfer rate or snn, we simulate an sis infectious disease on three kinds of metapopulation networks coupled with their respective snns. metapopulation networks with 20 0 0 subpopulations are generated following molloy and reed algorithm [27] . for sake of similarity, we assume that individuals in the same subpopulations travel along each link with the same probability. hence, weights of links are w (1) i j = 1 /k (1) i for the metapopulation network, and w (2) i j = 1 /k (2) i for its snn. with regard to each subpopulation i , the initial size of population depends on the degree of this subpopulation, i.e., n i = k (1) i / k n . n denotes the average population of whole network. focusing on the effect of transfer rate, we keep other parameters unchanged, and they are n = 10 0 0 , β = 0 . 4 , γ = 0 . 2 , δ = 0 . 1 . first, we consider a simplest connected metapopulation network, whose nodes are arranged in a straight line (named as linear metapopulation network), and simulate the spread of disease on this network coupled with its snn (shown in fig. 5 ). this network can be regarded as a regular network with degree 2 when the number of nodes is large enough. in each panel, we make a comparison among three values of transfer rates: q = 0 for no transfer, q = 0 . 5 for half of travelers choosing to transfer, and q = 1 for all individuals traveling along the snn. from fig. 5 , the fractions of infected subpopulations and infectious individuals both increase almost linearly. and the speed of disease transmission when q = 0 . 5 is nearly twice as fast as that of the other two cases. when transmission process reaches a steady state, the fractions of infected subpopulations and infectious individuals when all individuals travel along the snn are nearly half of that of the other two cases. the reason is obvious. the snn is not connected and it is composed of two linear subnets (shown in fig. 6 ). the change in network connectivity hinders the spread of disease to some degree. however, moderate transfer rate does accelerate the transmission of disease. these two results hold for all cases where the metapopulation network is connected while its snn not. under these cases, controlling direct flights may limit the spread of disease to a relatively small area. second, we investigate two kinds of typical networks with the same average degree 7, and let the networks and their snns keep the same network connectivity. as illustrated in figs. 7 and 8 , comparing the left and right panels of these two figures, we find that at the early phase of transmission disease occurs and outbreaks in a small number of subpopulations and infectious individuals increase slowly. when there exist infectious individuals in majority of subpopulations, the fractions of infectious individuals rise sharply and then reach a steady state in a short time. it is easily seen that the behavior of individual transfer accelerates the transmission of disease. increasing rates of transfer paves the way for infectious individuals transmitting disease to more susceptible subpopulations. however, transfer rate has little effect on the final fraction of infectious individuals, which is consistent with the theoretical results in section 4 . although transfer rate contributes to the global spread of infectious diseases, the effect of transfer rate differs in heterogeneity of next-nearest degrees. in fig. 8 , for power-law networks, moderate transfer rate is most conducive to the spread of infectious diseases. however, when q = 1 , the speed of transmission is slightly slower than that of q = 0 . 5 . referring to poisson networks (shown in fig. 7 ) , along the increase of transfer rate, the speed of spread displays an increase trend, at odds with power-law networks. this is owing to next-nearest degrees. for power-law networks (see fig. 4 (a) ), with the number of second neighbors climbs from 4 to 537, weights of links of the snn gradually decrease, which lower the probability of individuals traveling to second neighbors. in contrast, the distribution of next-nearest degrees for poisson network is relatively concentrated. under the circumstances, controlling direct flights may accelerate the global spread of disease. the role played by indirected flights can not be ignored. maybe no traveling is the best measure. in this paper, we took neglected indirect flights into account, and put forward a definition of snn for a undirected network. similar to general networks, we defined adjacency matrix, next-nearest degree and its distribution on this network. upon these bases, we proposed an ordinary differential equation group to curve the effect of transfer rate on the global transmission of an infectious disease. next, we obtained the limiting system of the model and gave the expression of the basic reproduction number, which depends only on disease parameters. further, the global stability of dfe and ee has been proven. then, we presented some simulation results on three kinds of connected metapopulation networks with different average degrees and different degree distributions. one is a linear metapopulation network with average degree approximately equaling to 2, and the other two are with the same average degree 7. we find that if the snn is not connected, controlling direct flights may hinder the spread of disease. on the contrary, if the snn is also connected, controlling direct flights may accelerate the spread of disease. it is in common that moderate transfer rate contributes to the global spread of infectious diseases. in detail, for a linear network, the numbers of infected subpopulations and infectious individuals increase almost linearly. for a poisson network, the dominant role is second neighbor because of its relatively homogeneous distribution. however, for the other two networks, moderate transfer rate is most conducive to the spread of infectious diseases, which means that although the existence of second neighbors may promote the global transmission of infectious diseases, the roles played by neighbors are still significant. therefore, when an infectious disease occurs, the governments should adjust measures to local conditions. that is, if the network connectivity is reduced after controlling direct flights, this measure is effective; otherwise, if the network connectivity keep the same with the original network, this measure fails. it may be more effective to control all flights (direct or indirect) properly. our studies shed lights on disease control and prevention. but there may be some problems to be solved. when people travel, they may traverse more than one place before reaching their destinations. this case is rare for aviation network but common for high-speed train and rail networks. under similar hypotheses, our model can be popularized to the thirdneighbor network, the fourth-neighbor network, and so on, and then be applied to high-speed train and rail networks. it is worth noticing that more stops may lead to according change of time scale, such as a general train, whose speed is so slow that it is time-consuming traveling between two places with long distance. forecast and control of epidemics in a globalized world pandemic potential of a strain of influenza a (h1n1): early findings human infection with a novel avian-origin influenza a (h7n9) virus assessing the pandemic potential of mers-cov assessing the impact of travel restrictions on international spread of the 2014 west african ebola epidemic potential for zika virus introduction and transmission in resource-limited countries in africa and the asia-pacific region: a modelling study epidemic modeling in metapopulation systems with heterogeneous coupling pattern: theory and simulations rendezvous effects in the diffusion process on bipartite metapopulation networks contagion dynamics in time-varying metapopulation networks effects of local population structure in a reaction-diffusion model of a contact process on metapopulation networks epidemic spread on interconnected metapopulation networks epidemic spreading by objective traveling modeling human mobility responses to the large-scale spreading of infectious diseases safety-information-driven human mobility patterns with metapopulation epidemic dynamics interplay between epidemic spread and information propagation on metapopulation networks phase transitions in contagion processes mediated by recurrent mobility patterns invasion threshold in structured populations with recurrent mobility patterns heterogeneous length of stay of hosts' movements and spatial epidemic spread human mobility and spatial disease dynamics mean-field diffusive dynamics on weighted networks a multi-species epidemic model with spatial dynamics reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission nonnegative matrices moment closure of infectious diseases model on heterogeneous metapopulation network global asymptotic behavior in some cooperative systems of functional differential equations networks: an introduction a critical point for random graphs with a given degree sequence key: cord-308249-es948mux authors: dokuka, sofia; valeeva, diliara; yudkevich, maria title: how academic achievement spreads: the role of distinct social networks in academic performance diffusion date: 2020-07-27 journal: plos one doi: 10.1371/journal.pone.0236737 sha: doc_id: 308249 cord_uid: es948mux behavior diffusion through social networks is a key social process. it may be guided by various factors such as network topology, type of propagated behavior, and the strength of network connections. in this paper, we claim that the type of social interactions is also an important ingredient of behavioral diffusion. we examine the spread of academic achievements of first-year undergraduate students through friendship and study assistance networks, applying stochastic actor-oriented modeling. we show that informal social connections transmit performance while instrumental connections do not. the results highlight the importance of friendship in educational environments and contribute to debates on the behavior spread in social networks. social environment has a significant impact on individual decisions and behavior [1] [2] [3] . people tend to assimilate the behavior, social norms, and habits of their friends and peers. it is empirically shown that social interactions play a key role in the spread of innovations [4] , health-related behavior [5, 6] , alcohol consumption and smoking [7, 8] , delinquent behavior [9, 10] , happiness [11] , political views [12, 13] , cultural tastes [14] , academic performance [15] [16] [17] [18] [19] . although there is an extensive body of research showing that a large proportion of social practices disseminates across social networks [3] , the question of what types of social contacts cause the spread of specific behavior remains open [1, 20] . in this paper, we analyze the diffusion of academic performance across different types of student social networks. while these social networks are extensively studied in the literature [15] [16] [17] [18] 21, 22] , there is a lack of agreement on whether social networks are effective channels for the academic performance spread [16, 18, 23] . and if they are, what types of networks serve the best for the propagation of the academic-related behavior? we analyze the spread of academic achievements within two different social networks of first-year undergraduate students. we test two mechanisms of academic performance diffusion in the student social networks. first, we analyze the academic performance spread through the friendship network, which can be considered as a network of informal social a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 interactions. second, we study the academic performance spread in the study assistance network which is aimed at study-related information and knowledge transmission, and serves for problem-solving [24] . we apply stochastic actor-oriented model (saom) for joint modeling of networks and behavior dynamics [25] . we model the evolution of two social systems. in the first model, we analyze the coevolution of friendship network and academic achievements. in the second model, we study the coevolution of study assistance network and academic achievements. both models are controlled with a variety of structural and behavioral properties such as a tendency to form mutual ties and befriend similar others. results show that academic performance spreads through friendship connections, while the study assistance ties do not cause the performance transfer. social networks are the pathways for the behavior transmission. this process may be guided by various factors, including the network topology [26] , type of propagated behavior, nature of social contacts, and other features of the social environment. the majority of recent studies on this topic are concentrated on the structural properties of the networks that drive the behavior diffusion processes. for example, it was experimentally shown that short average path length and high clustering cause a faster behavior spread [27] that can be explained by the formation of the dense network communities with the fast information and behavior exchange within these cohesive groups. in [28] it was demonstrated that there are differences in spreading processes initiated by well-connected actors, or hubs, and by actors with a few social connections. hubs are effective in information propagation due to the high number of connections, while actors with a few ties are more efficient in spreading messages that are controversial or costly. the probability of behavior adoption by an individual is also highly correlated with the number of social contacts that directly influence this individual [29] . the influence by many peers, or so-called "complex contagion", results in faster and easier behavior adoption, rather than the influence by one person, or "simple contagion" [26] . the efficacy of social contagion is often associated with the type of propagated behavior. centola and macy outline the danger of considering the social contagion studies in the 'whatever to be diffused' way [30] . for example, the adoption of information is much less risky, costly, and time-consuming, rather than the adoption of health-related behavior, sports habits, and academic achievements. the nature of social ties is also a significant factor for behavior transmission. social connections are traditionally divided into "weak" and "strong" ties, and they exhibit completely different spreading patterns [31] . strong ties are formed within the dense network communities such as family or friends, while weak ties, according to granovetter's definition, emerge during the whole life and represent people who are marginally included in the network of contacts, such as old college friends or colleagues [31] . empirical literature shows that both types of relationships can serve as channels for the diffusion of behavior or information [1, 20] . but weak ties are important instruments for information propagation, while strong ties are more successful in costly behavior transmission. although the vast array of theoretical and empirical studies improved our understanding of the behavior transmission processes, there is still an open question regarding the differences of the behavior spread in networks of different natures. social ties can vary both in the level of their strength and intensity, as we outlined above, and in their origins. networks can be based on friendship, romantics, advice seeking, social support, and many other relationships. despite the huge variance in the social network types, the majority of the research on social diffusion is concentrated on the networks of friendship ties. however, the relationships of distinct nature can result in completely different behavior transmission processes. in this paper, we consider the transmission of academic performance within student social networks. this process attracts the attention of researchers since the publication of the famous "coleman report" [21] . this report showed that students tend to obtain similar grades as their peers, classmates, and friends, and this effect remains strong after controlling for a variety of socio-economic and cognitive variables. further empirical studies demonstrated the presence of this effect in various case studies. for example, it was shown that student grade point average (gpa) increases if her dormmate is in the highest 25th gpa percentile [32] . in [15] , mba students tend to assimilate the grades of their friends and advisers. it was also demonstrated that this social influence is associated with the personal characteristics of students and the nature of their social connections. for instance, lower-achieving students are more influenced by their peers [33, 34] , the diffusion of academic performance is stronger among women than men [35] , can be related to the race of a peer [36, 37] , and is stronger from close peers such as friends [38] . at the same time, online communication networks do not serve as effective channels for performance transmission. students tend to segregate in online networks based on their performance and this prevents the diffusion of achievements through online ties [18, 19] . summarizing, the majority of studies demonstrate that social networks are effective channels for the performance diffusion. it was shown that achievements spread well within friendship networks, while other types of ties (e.g. online relationships) do not serve as channels for the performance transmission. in this paper, we examine the diffusion of academic achievements in two distinct social networks: friendship and study assistance. we demonstrate that, despite the significant overlap in these networks, they exhibit different patterns of behavior transmission. we analyze the longitudinal data on friendship and study assistance networks and gpa of a first-year student cohort of the economics department in one of the leading russian universities in 2013-2014 academic year. in this university, students are randomly assigned by the administration to different study groups of up to 30 students. lectures are usually delivered to all the cohort simultaneously while seminar classes are delivered to each study group separately. in the first year, most of the courses are obligatory. therefore, students have a limited possibility to form networks with students from other groups, programs, or year cohorts. the academic year consists of four modules of two or three months. at the end of each module, students take final tests and exams. the grading system is at a 10-point scale where a higher score indicates a higher level of academic achievement. the course grade is the weighted average of midterm and final exams, homework, essays, and other academic activities during the course. the sample consists of 31% males and 69% females. the data for this study was gathered from two sources: the longitudinal student questionnaire survey (3 waves during the first academic year: october 2013, february 2014, and june 2014) and the university administrative database. in total, our dataset consists of 117 students that took part in at least two surveys with up to 700 connections between them in total. the detailed over-time aspect of the networks gives us a rich dataset of links of diverse nature. the sample can be considered representative to student cohorts in selective universities. in the questionnaire survey, we ask students about their connections within their cohort. the questions were formulated in the following way: 1. please indicate the classmates with whom you spend most of your time together; 2. please indicate the classmates whom you ask for help with your studies. the role of distinct social networks in academic performance diffusion. there were no limitations in the number of nominations. additionally, students were asked to indicate those classmates whom they knew before the admission to the university. we also gather information about students' study-group affiliation from the administrative database. in total, we have four different network types: friendship, study assistance, knowing each other before studies, and being in the same study group. from an administrative database of the university, we gather data about student performance (grade point average, or gpa at the end of the first year) that is measured on a scale from 0 to 10. we transform the data on performance from continuous to categorical scale and distinguish four performance groups based on the grading system of the university: high performing students (their gpa is equal or higher than 8), medium high performing students (with gpa from 6 to 8), medium low performing students (with gpa from 4 to 6) and low performing students (with gpa lower than 4). it is important to mention that the information about individual student grades is publicly available in this university. this is common in some russian universities but very different from educational systems in the european union and the us. in russian universities, grades are often publicly announced by teachers to the class. in the studied university, final grades are additionally published online on the university website. this creates a specific case when students know about the grades of each other and can coordinate their social connections depending on this information. individuals who did not participate in the questionnaire survey were excluded from the analysis (14 individuals, 10.7% of the sample). these missing data were not treated in a special way. we followed the recommendations (40) , suggesting that 'up to 10% missing data will usually not give many difficulties or distortions, provided missingness is indeed non-informative'. data collection procedures are described in the "data collection" section in s1 file. the descriptive statistics of the sample are presented in the si ("case description" section and tables 1-3 in s1 file). the network visualizations are presented in figs 1-6. standard statistical techniques such as regression models are not applicable for the analysis of social networks due to the interdependence of network observations [39] . therefore, we apply a stochastic actor-oriented model (saom) that allows to reveal the coevolution of network properties and behavior of actors [25, 40] . this dynamic model is widely used for studying the joint evolution of social networks, actor attributes, and separating the processes of social selection and social influence. in total, we estimate two models: the first model estimates the coevolution of friendship network and academic performance, the second one estimates the coevolution of study assistance network and academic performance. the saom's underlying principles are the following. firstly, network and behavior changes are modeled as markov processes which means that the network state at time t depends only on the (t-1) network state. secondly, saom is grounded on the methodological approach of structural individualism. it is assumed that all actors are fully informed about the network structure and attributes of all other network participants. thirdly, time moves continuously and all the macro-changes of the network structure are modeled as a result of the sequence of the corresponding micro-changes. this means that an actor, at each point in time, can either change one of the outgoing ties or modify his or her behavior. the last principle is crucial for the separation of the social selection and social influence processes. there are four sub-components of the coevolution of network and behavior: network rate function, network objective function, behavior rate function, and behavior objective function [25, 40] . the rate functions represent the expected frequencies per unit of time with which actors get an opportunity to make network and/or behavioral micro-changes (40) . the objective functions are the primary determinants of the probabilities of changes. the probabilities of the network and/or behavior change are higher if the values of the objective functions for the network/behavior are higher [25, 40] . the objective functions for the network (eq 1) and behavior change (eq 2) are calculated as a linear combination of a set of components called effects: where s ki (x) are the analytical functions (also called effects) that describe the network tendencies (40) ; β kz s ki z (x, z) are functions that depend on the behavior of the focal actor i, but also on the behavior of his or her network partners and a network position [22] ; β k and β k z are statistical parameters that show the significance of the effects. saom coefficients are interpreted as logistic regression coefficients. parameters are unstandardized, therefore, the estimates for different parameters are not directly comparable. during the modeling, saom allows the inclusion of endogenous and exogenous covariates. as endogenous variables, we include in our models network density, reciprocity, popularity, activity, transitivity, 3-cycles, transitive reciprocated triplets, and betweenness [40] . density and reciprocity show the tendency of students to form any ties and to form mutual ties. transitivity, 3-cycles, transitive reciprocated triplets, and betweenness measure a propensity of students to form triadic connections with their peers. popularity and activity are included to control the tendency of actors to receive many ties from others and to nominate a large number of actors. to control for social selection, we include the selection effect based on academic achievement. it shows whether students with similar levels of academic achievements tend to form connections with each other. we also controlled for the tendency of students with high grades to increase their popularity and activity over time. to test the presence of social influence, we include the effect of performance assimilation. it shows whether students tend to assimilate the academic achievement levels of their peers. in addition, we controlled for the propensity of students with high levels of popularity and activity to change their academic performance. in the model construction we follow the general network modeling requirements necessary for saom [40] . the role of distinct social networks in academic performance diffusion. all research protocols were approved by the hse (higher school of economics) committee on interuniversity surveys and ethical assess of empirical research. all human subjects gave their informed verbal consent prior to their participation in this research, and adequate steps were taken to protect participants' confidentiality. table 1 presents the modeling results of two separate models. in the first, we model the coevolution of friendship network and academic performance. in the second one, we model the coevolution of study assistance network and academic performance. social influence [effect 24] is positive and significant in the friendship social network. this means that the academic performance of students tends to become similar to the performance of their friends. in other words, academic achievements diffuse through friendship ties. in the study assistance network, however, social influence is not present. this indicates that students the role of distinct social networks in academic performance diffusion. do not assimilate the performance of their study assistants; this network channel does not propagate the spread of academic achievements. positive indegree effect [effect 25] suggests that students who are often asked for help increase their performance over time. the non-significant estimates of the linear and quadratic shape parameters [effects 22 and 23] for friendship indicate that the influence of peers sufficiently explains the performance dynamics [15] . the negative effect of the quadratic shape parameter [effects 23] for the study assistance network shows the convergence of the academic performance to unimodal distribution [15] . the effect of performance selection [effect 19] is positive for the study assistance network. it suggests that students with similar levels of academic achievements tend to ask each other for help. the effect of social selection in the friendship network is not significant. this means that students do not have a preference to befriend students with similar academic achievements. positive estimates for the performance of alter [effect 17] in both social networks suggest that individuals with high performance are popular in friendship and study assistance networks. positive effect for the performance of ego [effect 18] for friendship network shows that high performing students tend to create friendship connections. the role of distinct social networks in academic performance diffusion. we find the presence of gender homophily [effect 16] in both friendship and study assistance social networks. students tend to create friendship and study assistance connections with individuals with the same gender. positive effect of ego for males [effect 15] in the friendship network suggests that males tend to nominate more friends. the network control effects [effects 3, 4, 7, 8, 9, and 10] that were included in the models show expected signs and significance scores, as in most student social networks [25] shows that transitivity is less important for friendship ties when reciprocity is present (and vice versa) [41] . the combination of negative betweenness [effect 10] and positive transitivity [effect 7] in both networks demonstrate that individuals do not seek for brokerage positions and do not want to connect peers from different network communities and study groups. positive activity effect [effect 6] in friendship network indicates that students with many ties tend to create new friendship relationships. positive effect of popularity [effect 5] in study assistance network suggests that individuals ask for help those students, who are often asked for help by others. in friendship network this effect is negative, which means that students do not tend to befriend popular individuals, i.e. those who already have a lot of friends. in both networks rate parameters are larger in the first period rather than in the second, indicating that the tie formation stabilizes over time. the modeling results also show that students tend to create friendship and study assistance ties with individuals they knew before the enrollment [effect 11] and individuals from the same study group [effect 12]. also, students tend to create friendship connections with their study assistants [effect 13.1] and they seek for study assistance from their friends [effect 13.2]. we conducted the time heterogeneity test for both network models [40] . this test is used to examine whether the parameter values β k of the objective function are constant over the the role of distinct social networks in academic performance diffusion. periods of observation. we find the time heterogeneity in models. in both networks, parameters such as betweenness, acquaintance before enrollment, popularity and activity of the high performing individuals are heterogeneous. in the friendship network, there is also time heterogeneity for gender of alter and ego, performance social selection and influence. in the study significance codes ��� p < 0.001 �� p < 0.01 � p < 0.05. the models converged according to the t-ratios for convergence and the overall maximum convergence ratio criteria suggested in (40) . goodness of fit is adequate for all models. https://doi.org/10.1371/journal.pone.0236737.t001 the role of distinct social networks in academic performance diffusion. assistance network, we find time heterogeneity for studying in the same group, gender of ego, and performance of ego. the cases of previous acquaintance or being in the same study group can be explained by the nature of these types of ties. for instance, the acquaintance before enrollment can play a significant role at the beginning of studies, while after several months' students will tend to expand their networks and will not seek connections with individuals they knew before studies. the same explanation can be used for the case of being in the same study group. at the beginning of studies, students will form ties within their study groups but later they will tend to expand their network and form ties with other group members. differences in time heterogeneity of academic achievements may be related to the decreased statistical power of these effects between different models. the effects of academic performance on network evolution processes may be understood in details by considering all the performance-related effects simultaneously [40] . in table 2 , we present log-odds for the performance selection within different achievement groups. the higher the estimate, the higher the probability of a study assistance tie formation between students from different performance groups. table 2 shows that there is a significant tendency toward selection of high-performing individuals as study assistants, and this tendency is present among all groups of students. similarly, in table 3 we present precise estimates for the social influence process for all achievement groups. each row of the table corresponds to a given average behavior of the friends of an ego. values in the row show the relative 'attractiveness' of the different potential values of ego's behavior. maximum diagonal value indicates that for each value of the average friends' behavior the actor 'prefers' to have the same behavior as all these friends (40) . this shows that individuals tend to assimilate their friends' performance. in this paper we explore the academic performance diffusion through two social networks of different natures: friendship and study assistance. we empirically confirm that educational the role of distinct social networks in academic performance diffusion. outcomes of students are diffused in different ways within friendship and study assistance networks. ties in the friendship network transmit academic achievements, while ties in the study assistance network do not. the absence of the social influence process along the presence of social selection in the study assistance network may suggest the presence of social segregation based on performance [19] . this can be related to the high competitiveness of the university environment under the study. we expect that some students are highly motivated to receive higher grades and prefer to invest time and effort in their high academic results rather than help their less academically successful peers. our findings demonstrate that the efficacy of academic achievements diffusion is determined by the nature of the social network. it was established that social integration in the classroom is positively associated with the higher academic performance of students [17, 19] . here we claim that it is extremely important to integrate individuals specifically in the network of informal friendship interactions and motivate them to create connections with higher-performing students. these findings support the idea that the nature of social relationships is crucial for the transmission of specific types of information and behavior in social networks. close friendship relationships serve as effective channels for the spread of various complex behaviors, including very costly behavior types such as health behavior [1] . academic performance is one of the examples of these behavior types that are not easily transmittable. in contrast, the instrumental study assistance ties do not produce the propagation of academic achievements from successful students to their lower-performing peers. to sum up, we show that costly and complex behavior (such as academic achievement) diffuses more effectively in the network of strong close connections such as friendship. these findings contribute to the current debates on behavior propagation in social networks and propose new insights on factors that impact the success of behavior transmission. this study has several limitations. first, we analyze social networks of first-year students. this time frame, when students start their educational path at an undergraduate level, receives a lot of attention in the literature [17, 19, 42] due to the fast speed of social tie formation. at the same time, it would be beneficial to investigate the diffusion of academic achievements through social networks along the full period of studies. second, we examine only two types of social relations, however, the spectrum of social ties that can serve as channels of the performance diffusion is much wider. it is a potential avenue for future studies to estimate the efficacy of other types of social networks such as cooperation, competition, romantic relationships, and negative ties on the process of academic achievement diffusion. the data on some of these networks is difficult to collect (e.g., negative relationships) due to the high sensitivity of studied relationships but these types of ties can be nevertheless significant for behavior transmission. in the time of the covid-19 pandemic and after it, it is also extremely important to examine the effect of online networks on academic performance transmission because online interaction remains the only communication channel for students. our empirical findings have several policy implications. academic achievements are one of the key components of financial success and individual well-being [43, 44] , that makes the performance increase is one of the main goals of the educational system. however individual achievements are quite stable and largely driven by heritable factors [45] which make interventions aimed at the academic performance growth highly complex and difficult to implement. one of the possible mechanisms of performance increase is social influence, as we show in this paper. teachers can pay additional attention to the development of informal friendship relationships between students with various performance levels during classes. it can be reached by group work assignments, in which group membership is defined by the teachers and is not based on the personal preferences of students. long-term group assignments such as working on a research project together can stimulate students from different achievement groups to develop friendship ties with each other. the creation of recreation and open spaces within the university building can also give additional options for students with distinct performance to meet, interact, and form friendship ties. the combination of these actions would help students to build and sustain their informal networks, which, in turn, serve as key channels of the academic performance diffusion and lead to a positive behavior change. supporting information s1 file. (docx) conceptualization: sofia dokuka, diliara valeeva, maria yudkevich. formal analysis: sofia dokuka. how behavior spreads: the science of complex contagions social cohesion social contagion theory: examining dynamic social networks and human behavior network interventions. science (80) the spread of obesity in a large social network over 32 years friendship as a social mechanism influencing body mass index (bmi) among emerging adults dynamics of adolescent friendship networks and smoking behavior teen alcohol use and social networks: the contributions of friend influence and friendship selection peer influences on moral disengagement in late childhood and early adolescence ainhoa de f. why and how selection patterns in classroom networks differ between students. the potential influence of networks size preferences, level of information, and group membership dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the framingham heart study a 61-million-person experiment in social influence and political mobilization peer networks and the development of illegal political behavior among adolescents social selection and peer influence in an online social network why are some more peer than others? evidence from a longitudinal study of social networks and individual academic performance academic achievement and its impact on friend dynamics integration in emerging social networks explains academic failure and success formation of homophily in academic performance: students change their friends rather than performance the rich club phenomenon in the classroom complex contagions: a decade in review. in: complex spreading phenomena in social systems equality of educational opportunity us department of health, education, and welfare, office of education changing friend selection in middle school: a social network analysis of a randomized intervention study designed to prevent adolescent problem behavior it's not your peers, and it's not your friends: some progress toward understanding the educational peer effect mechanism norms, status and the dynamics of advice networks: a case study introduction to stochastic actor-based models for network dynamics the social origins of networks and diffusion the spread of behavior in an online social network experiment. science (80-) anomalous structure and dynamics in news diffusion among heterogeneous individuals threshold models of collective behavior complex contagions and the weakness of long ties the strenght of weak ties peer effects with random assignment: results for dartmouth roommates peer effects in academic outcomes: evidence from a natural experiment does your cohort matter? measuring peer effects in college achievement parental and peer influences on adolescents' educational plans: some further evidence the effects of sex, race, and achievement on schoolchildren' s friendships students' characteristics and the peer-influence process classroom peer effects and student achievement a tutorial on methods for the modeling and analysis of social network data reciprocity, transitivity, and the mysterious three-cycle. soc networks short-term and long-term effects of a social network intervention on friendships among university students intelligence predicts health and longevity, but why? influence of sex and scholastic performance on reactions to job applicant resumes the stability of educational achievement across school years is largely explained by genetic factors key: cord-319658-u0wjgw50 authors: guven-maiorov, emine; tsai, chung-jung; nussinov, ruth title: structural host-microbiota interaction networks date: 2017-10-12 journal: plos comput biol doi: 10.1371/journal.pcbi.1005579 sha: doc_id: 319658 cord_uid: u0wjgw50 hundreds of different species colonize multicellular organisms making them “metaorganisms”. a growing body of data supports the role of microbiota in health and in disease. grasping the principles of host-microbiota interactions (hmis) at the molecular level is important since it may provide insights into the mechanisms of infections. the crosstalk between the host and the microbiota may help resolve puzzling questions such as how a microorganism can contribute to both health and disease. integrated superorganism networks that consider host and microbiota as a whole–may uncover their code, clarifying perhaps the most fundamental question: how they modulate immune surveillance. within this framework, structural hmi networks can uniquely identify potential microbial effectors that target distinct host nodes or interfere with endogenous host interactions, as well as how mutations on either host or microbial proteins affect the interaction. furthermore, structural hmis can help identify master host cell regulator nodes and modules whose tweaking by the microbes promote aberrant activity. collectively, these data can delineate pathogenic mechanisms and thereby help maximize beneficial therapeutics. to date, challenges in experimental techniques limit large-scale characterization of hmis. here we highlight an area in its infancy which we believe will increasingly engage the computational community: predicting interactions across kingdoms, and mapping these on the host cellular networks to figure out how commensal and pathogenic microbiota modulate the host signaling and broadly cross-species consequences. rather than existing as independent organisms, multi-cellular hosts together with their inhabiting microbial cells have been viewed as "metaorganisms" (also termed superorganisms or holobionts) [1] . millions of commensals, symbiotic, and pathogenic microorganisms colonize our body. together, they comprise the "microbiota". microbiota are indispensable for the host, as they contribute to the functioning of essential physiological processes including immunity and metabolism. hosts co-evolved with the microbiota. while some commensals are beneficial (symbionts), others may become harmful (pathobionts) [2, 3] . microbiota immune system development. the immune system recognizes antigens of microorganisms e.g. dna, rna, cell wall components, and many others, through pattern recognition receptors, such as toll-like receptors (tlrs) and downstream intracellular signaling circuitries are activated to generate immune responses [4] . however, like self-antigens, antigens from commensal microbiota are tolerated with no consequent inflammatory responses. this makes gut microbiota accepted as "extended-self" [5] . still, under some circumstances, commensals may act as pathogens. for example, staphylococcus aureus [6] or candida albicans [7] are commensals of human, but in "susceptible" hosts, they can undergo commensal-to-pathogen transition. thus, identifying microorganisms that reside in the host, and within these, those that are responsible for distinct host phenotypes, and the host pathways through which they act are significant goals in host-microbiota research. microbiota survival strategies within the host are likely to be limited. analysis of their repertoire may reveal core modules, thereby helping in classification, mechanistic elucidation and profile prediction. here we provide an overview of structural host-microbiota interaction networks from this standpoint. the host interacts with microbiota through proteins, metabolites, small molecules and nucleic acids [8, 9] . the microbiota employs a range of effectors to modulate host cellular functions and immune responses. they have sophisticated relationships with the host, and network representation enables an effective visualization of these relationships [10] . most proteins of bacterial and eukaryotic pathogens are not accessible to bind to host proteins; but some of their proteins either bind to host surface receptors [11] or enter the host cell and interact with host cytoplasmic proteins. various bacterial species have a secretion system-a syringe-like apparatus-through which they inject the bacterial effectors directly into the host cell cytoplasm [12] . via hmis, they specifically hone in on key pathways, alter host physiological signaling, evade the host immune system, modify the cytoskeletal organization [13, 14] , alter membrane and vesicular trafficking [2, 11, 13] , promote pathogen entry into the host, shift the cell cycle [15, 16] , and modulate apoptosis [17] . all are aimed to ensure their survival and replication within the host. host signaling pathways that are targeted by microbiota and turned on or off may change the cell fate. unraveling the hmis for both commensals and pathogens can elucidate how they repurpose the host signaling pathways and help develop new therapeutic approaches. hmis have complex and dynamic profiles. studies often focus on individual protein interactions and try to explain the pathogenicity of a microorganism with a single interaction. however, considering host-microbiota interactions one-at-a-time may not reflect the virulence scheme [18] . for instance, replication of vaccinia virus necessitates the establishment of a complex protein interaction network [19] and hence focusing on only one hmi is incomplete and may be misleading. at any given time, hundreds of different species reside in the gut. different microbial compositions and hence effector protein combinations from these microbial species may have additive (cross-activation) or subtractive (cross-inhibition) [4] impacts on the host pathways, which lead to signal amplification or inhibition, respectively (fig 1) . since numerous bacteria will be sensed by the host immune system at any given time, more than one signaling cascade will be active in a cell. communication and crosstalk among active, or active and inhibited, pathways determine the ultimate cellular outcome [4] : to survive, die, or elicit immune responses. the combinatorial ramifications of all active (or suppressed) host pathways and hmis will be integrated to shape the type and magnitude of the response, and thus the cell state. to tackle the pathogenicity challenge, it is reasonable to concomitantly consider all host pathways and hmis. the transkingdom (metaorganism) network analysis is a robust research framework that considers host and microbiota as a whole [1] . systems biology approaches that integrate the hmis with host endogenous protein interaction networks reveal the systematic trends in virulence strategies of pathogens. here we ask how interspecies (superorganism) networks can facilitate the understanding of the role of microbiota in disease and health. we focus on host-microbiota protein interaction networks since many bacteria or virus-induced pathological processes require physical interactions of host and microbial proteins [20] . the availability of genome-wide high throughput omics data makes it possible to associate microbiota with certain host phenotypes at multiple levels and construct host-pathogen interaction networks at the transcriptome [21], proteome combinatorial effects of microbial effectors and the active host pathways determine the cell response. (a) composition1 has certain microorganisms that secrete effector protein combination1. these effectors activate pathway1 in the host, which produces pro-inflammatory cytokines. (b) composition2 secretes effector combination2 and activates pathway2 in addition to pathway1. additive effects of these two pathways amplifies the signal and promotes inflammation (cross-activation). (c) microbial composition3 utilize effector combination3 to activate both pathway 1 and 3, which have opposing outcomes. subtractive effects of these pathways result in no inflammation (cross-inhibition). https://doi.org/10.1371/journal.pcbi.1005579.g001 [22], and metabolome levels [23] . steps toward the construction of host-microbiota networks of gene [1] , mrna [24], protein-protein interaction (ppi) [25] [26] [27] [28] , and metabolic networks [29] have already been taken. within this framework we highlight molecular mimicry, a common strategy that microorganisms exploit to bind to host proteins and perturb its physiological signaling. mimicry of interactions of critical regulatory nodes in core network modules in the immune system, may be a major way through which pathogens adversely subvert-and commensal microbiota may beneficially modulate-the host cell. microbiota developed several strategies to interact with host proteins and modulate its pathways. one efficient way is molecular mimicry, which has been extensively reviewed in our recent study [9] . molecular mimicry can take place at four levels: mimicking (i) both sequence and 3d structure of a protein, (ii) only structure without sequence similarity, (iii) sequence of a short motif-motif mimicry, and (iv) structure of a binding surface without sequence similarity-interface mimicry. interface mimicry (protein binding surface similarity) seems to be the most common type of molecular mimicry. global structural similarity is much rarer than interface similarity both within and across species. thus, employing interface mimicry instead of full-length sequence or structural homology allows microbes to target more host proteins. molecular mimicry follows the principle suggested over two decades ago that proteins with different global structures can interact in similar ways [30] [31] [32] . interface mimicry is frequently observed within intra-[33-35] and inter-species [18, 36] (fig 2) (intra-species interface mimicry: distinct proteins from the same species having the same/similar interfaces; inter-species interface mimicry: proteins from different species hijack the same interface architectures). interface similarity allows proteins to compete to bind to a shared target. if an interface is formed between proteins from the same species, it is an 'endogenous interface'. if it is formed by proteins from two different species, it is an 'exogenous interface' [18, 36] . endogenous (intra-species) interfaces mimic each other [33] [34] [35] , and exogenous (inter-species) interfaces mimic endogenous interfaces (fig 2) [18, 36]. by mimicking endogenous interfaces, exogenous interfaces enable pathogenic proteins to compete with their host counterparts and hence rewire host signaling pathways for their own advantage [9] . they can either inhibit or activate a host pathway. for example, the helicobacter pylori secreted protein caga interacts with human tumor suppressor tp53bp2, inhibits apoptosis and allows survival of infected host cells [37] . however, map protein of e. coli and sope protein of salmonella bacteria bind and activate human cdc42, a rho gtpase, and trigger actin reorganization in the host cell, facilitating bacterial entry into the host [38]. one of the most significant pattern recognition receptor families in the innate immune system is the tlr family. its members detect diverse bacterial compounds, like peptidoglycan, lipopolysaccharide, and nucleic acids of bacteria and viruses. they induce pro-inflammatory or anti-viral responses. once activated, they recruit other tir-containing proteins such as mal and myd88 or tram and trif through their cytoplasmic tir domains, forming the myd88-and trif-dependent tir domain signalosomes, respectively [39]. myd88 also assembles into a myddosome structure through its death domain together with irak4 and irak1/2 death domains. the myddosome then recruits e3 ubiquitin ligases-either traf6 or traf3 -to catalyze the addition of k63-linked ubiquitin chains to themselves, which serve as a docking platform for other proteins to bind, such as tak1. subsequently, nf-κb and mapk pathways are activated. in the nf-κb pathway, tak1 phosphorylates and activates ikk. activated ikk in turn phosphorylates iκb, which is the inhibitor of nf-κb. phosphorylated iκb is then ubiquitylated by other e3 ubiquitin ligases (k48-linked ubiquitin chain) and targeted for proteosomal degradation. this liberates the p65 subunit of nf-κb to translocate to nucleus and initiate transcription. in the mapk pathway, tak1 serves as a map3k that activates erk1/2, p38 and jnk pathways. the trif-dependent downstream path of tlrs recruits traf3 and leads to activation of interferon regulatory factors (irfs) and production of key antiviral cytokines, interferons (ifns). the tlr pathway is regulated by several endogenous negative regulators to prevent excess inflammation [40] . since this is one of the major immune pathways, its signaling is targeted by diverse microorganisms at various steps (fig 3) , compete with endogenous tir-containing proteins and interfere with the assembly of the tir-domain signalosome and prevent downstream signaling. since these microbial proteins do not enzymatically modify the endogenous proteins, elucidation of their inhibition mechanism requires structural information. the availability of the structures of their complexes with the orchestrators of the tlr pathway can clarify how they inhibit downstream signaling. microbial proteases prevent both tlr-induced mapk and nf-κb signaling and lead to proteosomal degradation of the key orchestrators in these pathways: nled of e. coli cleaves jnk and p38, inhibiting mapk pathway; and nlec cleaves p65, inhibiting nf-κb [46] . there are also bacterial acetyltransferases ( [57, 58] inhibit this protein to limit ifn production [59] . here, we listed only a couple of microbial proteins targeting tlr pathway as examples. there are many others. the tlr pathway does not constitute the whole innate immune system; other immune pathways also need to be considered as well as how these microbial proteins affect them as a whole. this can help foreseeing what kind of responses the coordinated actions of these pathways together with tlrs would generate. most cellular processes are elicited by proteins and their interactions. graph representations of ppi networks, where proteins are the nodes and their interactions are edges, are helpful for delineating the global behavior of the network. topological features of networks, such as degree (number of edges), betweenness-centrality (how a node affects the communication between two nodes), lethality-centrality, hubs (proteins with high node-degree, i.e. several a, b, c, d are host proteins and p is pathogenic protein. protein a has two interfaces: through blue interface it binds to b and through grey interface it binds to c and d. c and d proteins employ similar interfaces to bind to a. so, endogenous interfaces mimic each other. pathogenic protein p has similar interface as b and competes to bind to the blue interface on a. in this case, an exogenous interface mimics an endogenous interface. (b) the f1l protein of variola virus interacts with human bid protein (5ajj:ab.pdb) and inhibits apoptosis in the host cell by hijacking the interface between human bid-bclxl (4qve:ab.pdb): an exogenous interface mimicking an endogenous one. human mcl1 protein binds to human bid (5c3f:ab.pdb) in a very similar fashion that bclxl does: endogenous interfaces mimicking each other. https://doi.org/10.1371/journal.pcbi.1005579.g002 interaction partners), non-hubs (with only a few partners), and bottlenecks (nodes with high betweenness-centrality) help characterization of the importance of the nodes, i.e. the contribution of the node to network integrity [60, 61] . early on, hubs were classified as either party or date hubs. while party hubs interact with many partners at the same time since they use distinct interfaces, date hubs interact with their partners one at a time due to their overlapping interfaces. to infer whether a hub is party or date hub, structural information (interface residues) [62] or gene expression data (co-expressed proteins have higher chances of interacting with each other) [63] were used. later on, this definition was questioned. among the reasons were the many examples where a protein node can serve concomitantly as a party and date hub. large assemblies typically fall into this category. biological networks are often scale-free, with many non-hubs and fewer hubs [64, 65] . not all nodes have the same effect on the network: random node attacks do not harm the network as much as removing hubs from scale-free networks [66] . degree and betweenness-centrality are measures of the contribution of nodes to network integrity. there are also "essential" nodes, knock-out of which leads to lethality: a feature also known as "lethality-centrality". attack of a hub by microbiota is likely to influence the cell, either resulting in lethality, or in beneficial modulation. thus, integrated superorganism interaction networks may suggest candidate host and microbial node targets. structural interspecies networks and their topological features can shed light on how microbiota alter the host signaling and what will the outcome in different settings be. available hmi networks demonstrate that different bacteria often hijack the same host pathway in distinct ways [12] , like the tlr pathway subversion by numerous microbial species (fig 3) . however, importantly, the same host pathway is often targeted at several nodes, which was suggested to guarantee modulation of cellular function [12] . although there are a number of examples of constructed networks of host-pathogen superorganism interactions [12, 19, [67] [68] [69] [70] [71] [72] [73] [74] [75] , there are many fewer attempts of integrating 3d structural data with the hmi networks [18] . traditional network representation has low resolution, missing important details. however, structural interaction networks provide a higher resolution with mechanistic insights. they can decipher and resolve those that are not obvious in binary interaction networks [36] . the potential of structural networks in unraveling signaling pathways was demonstrated earlier [39, 40, 76, 77] . they are essential to fully grasp the mechanisms exerted by pathogens to divert the host cell signaling and attenuate immune responses. fig 4 displays an example of a structural hmi network, showing how host ppis can be affected by hmis. structures can detail which endogenous host ppis are disrupted by the hmis, possible consequences of mutations on either host proteins or pathogenic proteins, and whether variants of a virulence factor in different strains of the same species have distinct hmis. for instance, the pro-35 residue on hiv accessory protein vpr is at the interface with human cypa and its mutation to alanine abrogates the interaction [78] . the structure of the cypa-vpr complex shows that pro-35 is at the interface. if the structure of the vpr-cypa complex was unknown, it would have been difficult to understand why, or how, this mutation disrupts the ppi. previously built structural hmi networks demonstrated that endogenous interfaces that are hijacked by pathogens are involved in multiple transient interactions [18, 36] . these endogenous interfaces exhibit 'date-like' features, i.e. they are involved in interactions with several endogenous proteins at different times [18, 36] . hub and bottleneck proteins at the crossroads of several host pathways were suggested to be the major targets of viral and bacterial proteins [26, 28] and interface mimics allow transient interactions with the hub [79] . this allows them to interfere with multiple endogenous ppis. it was proposed that microorganisms causing acute infections, which are dramatic for the host, are likely to interfere with the hubs, whereas others that lead to persistent infections tend to target non-hubs [80] . during acute infection, pathogens replicate very quickly and are transmitted to new hosts. however, during chronic infections, they adapt to the host environment, which allows them to reside there for a long period of time. thus, how microbiota target certain proteins and pathways at the molecular level is of paramount importance. detecting the hmis, mapping them onto networks and determining their 3d structures as a complex are the major steps to construct structural hmi networks. despite the progress in experimental techniques, it is still challenging to determine structures of ppi complexes, particularly hmis. since large-scale experimental characterization of host-pathogen ppis is difficult, time consuming, and costly, experimentally verified hmi data is scarce. it is important to note that available endogenous protein structures are biased towards permanent, rather than transient interactions. if majority of the hmis are transient, this presents another hurdle since they will be under-represented in the structural space. several hmi databases have been developed, such as phisto [81] , hpidb [82] , proteopathogen [83] , patric [84] , phi-base [85] , phidias [86] , hopaci-db [87] , virhostnet [88] , virbase [89] , virusmentha [90] , hcvpro [91] , and likely some others as well. however, these databases cover only a limited number of pathogens and their interactions. given that thousands of species residing in the host, thousands of hmis are yet to be identified. computational approaches are becoming increasingly important in prioritizing putative hmis and complementing experiments. hence, construction of comprehensive metaorganism networks and increasing the coverage of the host-microbiota interactome will still mostly rely on computational models in the near future [92] . computational modeling of intra-species interactions is a well-established area; detection of inter-species interactions is relatively new. available computational tools to predict host-pathogen interactions have been recently reviewed by nourani et al. [93] . current methods mostly depend on global sequence and structure homology. sequence-based methods focus only on orthologs of host proteins. however, sequence by itself is insufficient to detect the targets of pathogenic proteins because several virulence factors do not have any sequence homologs in human. for instance, the vaca protein of helicobacter pylori, the most dominant species in gastric microbiota, has a unique sequence that does not resemble any human protein [94] . still, it alters several host pathways [95] . with sequence-based methods, it is impossible to find hmis for vaca. as noted above, global structural mimicry is much rarer than interface mimicry. hence, utilizing interface similarity, rather than global structural similarity in a computational approach would generate a more enriched set of hmi data together with atomic details [9] . several studies suggested that the available interface structures are diverse enough to cover most human ppis [96] [97] [98] [99] . therefore, success of template-based methods for prediction of human ppis is very high [34] . since exogenous interfaces mimic endogenous ones, both available endogenous and exogenous interface structures can be used as templates to detect novel hmis. thanks to the rapid increase in the number of resolved 3d structures of human-pathogen ppis in recent years [100] and advances in structural and computational biology, the performance of interface-based methods is expected to increase. both experimental and computational approaches have false-positives and false-negatives with varying rates depending on the approach. although the coverage of interface-based methods is higher, their false-positive rate is also higher. despite this, attempts to complete the host-microbiota interactome will improve our knowledge of microbiota and their roles in health and disease. advances in host-microbiota research will revolutionize the understanding of the connection between health and a broad range of diseases. building the rewired host-microbiota multiorganism interaction network, along with its structural details, is vital for figuring out the molecular mechanisms underlying host immune modulation by microbiota. topological features of such networks can reveal the selection of host targets by the microbiota. structural details are essential to fully grasp the mechanisms exerted by microbiota to subvert the host immunity. identification of the hmis will also help drug discovery and integrated superorganism networks would suggest how inhibition of an hmi can influence the whole system. here we highlighted the importance of building structural hmi networks. however, not only hmis are important; although to date data are scant, crosstalk among microorganisms is also emerging as critical. alterations in their population dynamics may lead to dysbiosis. signals from gut microbiota resulting from population shifts can affect profoundly several tissues, including the central nervous system. dysbiosis of microbiota is involved in several diseases, such as inflammatory bowel disease [101] , autoimmune diseases (e.g. multiple sclerosis) [102] , neurodegenerative diseases (e.g. parkinson's) [103] , and cancer [104, 105] . identifying bacterial effectors, or effector combinations, which are responsible for specific phenotypes, is challenging. in line with this, recently, parkinson's disease (pd) patients are found to have altered gut microbiota composition [106, 107] . transplanted microbiota from pd patients, but not from healthy controls, induce motor dysfunction and trigger pd in mice. it is not clear however whether dysbiosis triggers pd or it arises as a consequence of the disease [103] . the role of microbiota in host health and disease might be even more complex than thought: commensals once being benign can convert to disease-causing pathogens; different compositions of microbial communities trigger different phenotypes; more than one host pathway is targeted by more than one effector; the same microbial effector/antigen is sensed by several pattern recognition receptors (back-up mechanism, compensatory microbial sensing [4] ) and genetic variation in hosts results in different responses (i.e. some commensals transition to pathogen only in "susceptible" individuals). current knowledge on microbiota and their interactions with the host is still in its infancy, but given the advances that are accomplished so far and the attention this field started to attract these days, it is likely that many unknowns and questions will be uncovered soon. investigating a holobiont: microbiota perturbations and transkingdom networks cellular hijacking: a common strategy for microbial infection diet, microbiota and autoimmune diseases integration of innate immune signaling self or non-self? the multifaceted role of the microbiota in immune-mediated diseases differential expression and roles of staphylococcus aureus virulence determinants during colonization and disease from commensal to pathogen: stage-and tissue-specific gene expression of candida albicans a review on computational systems biology of pathogen-host interactions pathogen mimicry of host protein-protein interfaces modulates immunity network representations of immune system complexity anti-immunology: evasion of the host immune system by bacterial and viral pathogens manipulation of host-cell pathways by bacterial pathogens structural mimicry in bacterial virulence structural microengineers: pathogenic escherichia coli redesigns the actin cytoskeleton in host cells human papillomavirus oncoproteins: pathways to transformation the human papillomavirus 16 e6 protein binds to tumor necrosis factor (tnf) r1 and protects cells from tnf-induced apoptosis chronic helicobacter pylori infection induces an apoptosis-resistant phenotype associated with decreased expression of sars coronavirus papain-like protease inhibits the type i interferon signaling pathway through interaction with the sting-traf3-tbk1 complex the ny-1 hantavirus gn cytoplasmic tail coprecipitates traf3 and inhibits cellular interferon responses by disrupting tbk1-traf3 complex formation hantaviral proteins: structure, functions, and role in hantavirus infection competitive binding and evolvability of adaptive viral molecular mimicry. biochimica et biophysica acta the importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics network biology: understanding the cell's functional organization relating three-dimensional structures to protein networks provides evolutionary insights evidence for dynamically organized modularity in the yeast protein-protein interaction network topological properties of protein interaction networks from a structural perspective scale-free networks in cell biology lethality and centrality in protein networks herpesviral protein networks and their interaction with the human proteome the protein network of hiv budding epstein-barr virus and virus human protein interaction maps hepatitis c virus infection protein network. molecular systems biology a physical and regulatory map of host-influenza interactions reveals pathways in h1n1 infection a physical interaction network of dengue virus and human proteins global landscape of hiv-human protein complexes viral immune modulators perturb the human molecular network by common and unique strategies interpreting cancer genomes using systematic host network perturbations by tumour virus proteins the structural network of interleukin-10 and its implications in inflammation and cancer the structural network of inflammation and cancer: merits and challenges. seminars in cancer biology the host-pathogen interaction of human cyclophilin a and hiv-1 vpr requires specific n-terminal and novel c-terminal domains use of host-like peptide motifs in viral proteins is a prevalent strategy in host-virus interactions targeting of immune signalling networks by bacterial pathogens phisto: pathogen-host interaction search tool hpidb-a unified resource for host-pathogen interactions proteopathogen, a protein database for studying candida albicans-host interaction patric, the bacterial bioinformatics database and analysis resource the pathogen-host interactions database (phi-base): additions and future developments phidias: a pathogen-host interaction data integration and analysis system hopaci-db: host-pseudomonas and coxiella interaction database virhostnet 2.0: surfing on the web of virus/host molecular interactions data virbase: a resource for virus-host ncrna-associated interactions virusmentha: a new resource for virus-host protein interactions hcvpro: hepatitis c virus protein interaction database computational analysis of interactomes: current and future perspectives for bioinformatics approaches to model the host-pathogen interaction space computational approaches for prediction of pathogen-host proteinprotein interactions a tale of two toxins: helicobacter pylori caga and vaca modulate host pathways that impact disease the helicobacter pylori's protein vaca has direct effects on the regulation of cell cycle and apoptosis in gastric epithelial cells structure-based prediction of protein-protein interactions on a genome-wide scale proceedings of the national academy of sciences of the united states of america structural space of protein-protein interfaces is degenerate, close to complete, and highly connected templates are available to model nearly all complexes of structurally characterized proteins structural models for host-pathogen protein-protein interactions: assessing coverage and bias the microbiome in inflammatory bowel disease: current status and the future ahead alterations of the human gut microbiome in multiple sclerosis gut microbiota regulate motor deficits and neuroinflammation in a model of parkinson's disease commensal bacteria control cancer response to therapy by modulating the tumor microenvironment gastrointestinal cancers: influence of gut microbiota, probiotics and prebiotics colonic bacterial composition in parkinson's disease gut microbiota are related to parkinson's disease and clinical phenotype key: cord-155440-7l8tatwq authors: malinovskaya, anna; otto, philipp title: online network monitoring date: 2020-10-19 journal: nan doi: nan sha: doc_id: 155440 cord_uid: 7l8tatwq the application of network analysis has found great success in a wide variety of disciplines; however, the popularity of these approaches has revealed the difficulty in handling networks whose complexity scales rapidly. one of the main interests in network analysis is the online detection of anomalous behaviour. to overcome the curse of dimensionality, we introduce a network surveillance method bringing together network modelling and statistical process control. our approach is to apply multivariate control charts based on exponential smoothing and cumulative sums in order to monitor networks determined by temporal exponential random graph models (tergm). this allows us to account for potential temporal dependence, while simultaneously reducing the number of parameters to be monitored. the performance of the proposed charts is evaluated by calculating the average run length for both simulated and real data. to prove the appropriateness of the tergm to describe network data some measures of goodness of fit are inspected. we demonstrate the effectiveness of the proposed approach by an empirical application, monitoring daily flights in the united states to detect anomalous patterns. the digital information revolution offers a rich opportunity for scientific progress; however, the amount and variety of data available requires new analysis techniques for data mining, interpretation and application of results to deal with the growing complexity. as a consequence, these requirements have influenced the development of networks, bringing their analysis beyond the traditional sociological scope into many other disciplines, as varied as are physics, biology and statistics (cf. amaral et al. 2000; simpson et al. 2013; chen et al. 2019 ). one of the main interests in network study is the detection of anomalous behaviour. there are two types of network monitoring, differing in the treatment of nodes and links: fixed and random network surveillance (cf. leitch et al. 2019) . we concentrate on the modelling and monitoring of networks with randomly generated edges across time, describing a surveillance method of the second type. when talking about anomalies in temporal networks, the major interest is to find the point of time when a significant change happened and, if appropriate, to identify the vertices, edges or graph subsets which considerably contributed to the change (cf. akoglu et al. 2014 ). further differentiating depends on at least two factors: characteristics of the network data and available time granularity. hence, given a particular network to monitor it is worth first defining what is classified as "anomalous". to analyse the network data effectively and plausibly, it is important to account for its complex structure and the possibly high computational costs. our approach to mitigate these issues and simultaneously reflect the stochastic and dynamic nature of networks is to model them applying a temporal random graph model. we consider a general class of exponential random graph models (ergm) (cf. frank and strauss 1986; robins et al. 2007; schweinberger et al. 2020) , which was originally designed for modelling cross-sectional networks. this class includes many prominent random network configurations such as dyadic independence models and markov random graphs, enabling the ergm to be generally applicable to many types of complex networks. hanneke et al. (2010) developed a powerful dynamic extension based on ergm, namely the temporal exponential random graph model (tergm). these models contain the overall functionality of the ergm, additionally enabling time-dependent covariates. thus, our monitoring procedure for this class of models allows for many applications in different disciplines which are interested in analysing networks of medium size, such as sociology, political science, engineering, economics and psychology (cf. carrington et al. 2005; ward et al. 2011; das et al. 2013; jackson 2015; fonseca-pedrero 2018) . in the field of change detection, according to basseville et al. (1993) there are three classes of problems: online detection of a change, off-line hypotheses testing and off-line estimation of the change time. our method refers to the first class, meaning that the change point should be detected as soon as possible after it occurred. in this case, real-time monitoring of complex structures becomes necessary: for instance, if the network is observed every minute, the monitoring procedure should be faster than one minute. to perform online surveillance for real-time detection, the efficient way is to use tools from the field of statistical process control (spc). spc corresponds to an ensemble of analytical tools originally developed for industrial purposes, which is applied for achievement of process stability and variability reduction (e.g., montgomery 2012) . the leading spc tool for analysis is a control chart, which exists in various forms in terms of the number of variables, data type and different statistics being of interest. for example, the monitoring of network topology statistics applying the cumulative sum (cusum) chart and illustrating its effecpresent a comparative study of univariate and multivariate ewma for social network monitoring. an overview of further studies is provided by noorossana et al. (2018) . in this paper, we present an online monitoring procedure based on the spc concept, which enables one to detect significant changes in the network structure in real time. the foundations of this approach together with the description of the selected network model and multivariate control charts are discussed in section 2. section 3 outlines the simulation study and includes performance evaluation of the designed control charts. in section 4 we monitor daily flights in the united states and explain the detected anomalies. we conclude with a discussion of outcomes and present several directions for future research. network monitoring is a form of an online surveillance procedure to detect deviations from a so-called in-control state, i.e., the state when no unaccountable variation of the process is present. this is done by sequential hypothesis testing over time, which has a strong connection to control charts. in other words, the purpose of control charting is to identify occurrences of unusual deviation of the observed process from a prespecified target (or in-control) process, distinguishing common from special causes of variation (cf. johnson and wichern 2007) . to be precise, the aim is to test the null hypothesis h 0,t : the network observed at time point t is in its in-control state against the alternative h 1,t : the network observed at time point t deviates from its in-control state. in this paper, we concentrate on monitoring of networks, which are modelled by the tergm that is briefly described below. the network (also interchangeably called "graph") is presented by its adjacency matrix y := (y i j ) i, j=1,...,n , where n represents the total number of nodes. two vertices (or nodes) i, j are adjacent if they are connected by an edge (also called a tie or link). in this case, y i j = 1, otherwise, y i j = 0. in case of an undirected network, y is symmetric. the connections of a node with itself are mostly not applicable to the majority of the networks, therefore, we assume that y ii = 0 for all i = 1, . . . , n. formally, we define a network model as a collection {p θ (y ), y ∈ y : θ ∈ θ}, where y denotes the ensemble of possible networks, p θ is a probability distribution on y and θ is a vector of parameters, ranging over possible values in the real-valued space θ ⊆ ir p with p ∈ in (kolaczyk, 2009 ). this stochastic mechanism determines which of the n(n − 1) edges (in case of directed labelled graphs) emerge, i.e., it assigns probabilities to each of the 2 n(n−1) graphs (see cannings and penman, 2003) . the ergm functional representation is given by where y is the adjacency matrix of an observed graph with s : y → ir p being a p-dimensional statistic describing the essential properties of network based on y (cf. frank, 1991; wasserman and pattison, 1996) . there are several types of network terms, including dyadic dependent terms, for example, a statistic capturing transitivity, and dyadic independent terms, for instance, a term describing graph density (morris et al., 2008) . the parameters θ can be defined as respective coefficients of s(y ) which are of considerable interest in understanding the structural properties of a network. they reflect, on the network-level, the tendency of a graph to exhibit certain sub-structures relative to what would be expected from a model by chance, or, on the tie-level, the probability to observe a specific edge, given the rest of the graph (block et al., 2018) . the last interpretation follows from the representation of the problem as a log-odds ratio. the normalising constant in the denominator ensures that the sum of probabilities is equal to one, meaning it includes all possible network configurations in dynamic network modelling, a random sequence of y t for t = 1, 2, . . . with y t ∈ y defines a stochastic process for all t. it is possible that the dimensions of y t differ across the time stamps. to conduct surveillance over y t , we propose to consider only the dynamically estimated parameters of a random graph model in order to reduce computational complexity and to allow for real-time monitoring. in most of the cases, the dynamic network models serve as an extension of well-known static models. similarly, the discrete temporal expansion of the ergm is known as tergm (cf. hanneke et al., 2010) and can be seen as further advancement of a family of network models proposed by robins and pattison (2001) . the tergm defines the probability of a network at the discrete time point t both as a function of counted subgraphs in t and by including the network terms based on the previous graph observations until the particular time point t − v, that is where v represents the maximum temporal lag, capturing the networks which are incorporated into the θ estimation at t, hence, defining the complete temporal dependence of y t . we assume the markov structure between the observations, meaning (y t ⊥ ⊥ {y 1 , . . . , y t−2 }|y t−1 ) (hanneke et al., 2010) . in this case, the network statistics s(·) include "memory terms" such as dyadic stability or reciprocity (leifeld et al., 2018) . the creation of a meaningful configuration of sufficient network statistics s(y ) replicates its ability to represent and reproduce the observed network close to the reality. its dimension can differ over the time, however, we assume that in each time stamp t we have the same network statistics s(·). in general, the selection of terms extensively depends on the field and context, although the statistical modelling standards such as avoidance of linear dependencies among the terms should be also considered (morris, handcock, and hunter, 2008 ). an improper selection can often lead to a degenerate model, i.e., when the algorithm does not converge consistently (cf. handcock, 2003; schweinberger, 2011) . in this case, as well as fine-tuning the configuration of statistics, one can modify some settings which design the estimation procedure of the model parameter, for example, the run time, the sample size or the step length (morris et al., 2008) . currently, there are two widely used approaches: chain monte carlo (mcmc) ml estimation (leifeld et al., 2018) . another possibility would be to add some robust statistics such as geometrically-weighted edgewise shared partnerships (gwesp) (snijders et al., 2006) . however, the tergm is less prone to the degeneracy issues as ascertained by leifeld and cranmer (2019) and hanneke et al. (2010) . regarding the selection of network terms, we assume that most of the network surveillance studies can reliably estimate beforehand the type of anomalies which are possible to occur. this assumption guides the choice of terms in the models throughout the paper. let p be the number of network statistics, which describe the in-control state and can reflect the deviations in the out-of-control state. thus, there are p variablesθ t = (θ 1t , . . . ,θ pt ) , namely the estimates of the network parameters θ at time point t. that is, we apply a moving window approach, where the coefficients are estimated at each time point t using the current and past z observed networks. moreover, let f θ 0 ,σ be the target distribution of these estimates with θ 0 = e 0 (θ 1 , . . . ,θ p ) being the expected value and σ the respective p × p variance-covariance matrix (montgomery, 2012) . we also assume that the temporal dependence is fully captured by the past z observed networks. thus, where τ denotes a change point to be detected and θ = θ 0 . if τ = ∞ the network is set to be incontrol, whereas it is out of control in the case of τ ≤ t < ∞. furthermore, we assume that the estimation precision of the parameters does not change across t, i.e., σ is constant for the in-control and out-of-control state. hence, the monitoring procedure is based on the expected values ofθ t . in fact, we can specify the above mentioned hypothesis as follows typically, a multivariate control chart consists of the control statistic depending on one or more characteristic quantities, plotted in time order, and a horizontal line, called the upper control limit (ucl) that indicates the amount of acceptable variation. a hypothesis h 0 is rejected if the control statistic is equal to or exceeds the value of the ucl. hence, to perform monitoring a suitable control statistic and ucl are needed. subsequently, we discuss several control statistics and present a method to determine the respective ucls. the strength of the multivariate control chart over the univariate control chart is the ability to monitor several interrelated process variables. it implies that the corresponding test statistic should take into account the correlations of the data, be dimensionless and scale-invariant, as the process variables can considerably differ from each other. the squared mahalanobis distance, which represents the general form of the control statistic, fulfils these criteria and is defined as being the part of the respective "data depth" expression -mahalanobis depth that measures a deviation from an in-control distribution (cf. liu, 1995) . hence, d (1) t maps the p-dimensional characteristic quantityθ t to an one-dimensional measure. it is important to note that the characteristic quantity at time point t is usually the mean of several samples at t, but in our case, we only observe one network at each instant of time. thus, the characteristic quantityθ t is the value of the obtained estimates and not the average of several samples. firstly, multivariate cusum (mcusum) charts (cf. woodall and ncube, 1985; joseph et al. (1990) ; ngai and zhang, 2001 ) may be used for network monitoring. one of the widely used version was proposed by crosier (1988) and is defined as follows where given that r 0 = 0 and k > 0. the respective chart statistic is and it signals if d (2) t is greater than or equals the ucl. certainly, the values k and ucl considerably influence the performance of the chart. the parameter k, also known as reference value or allowance, reflects the variation tolerance, taking into consideration δ -the deviation from the mean measured in the standard deviation units we aim to detect. according to page (1954) and crosier (1988) , the chart is approximately optimal if k = δ /2. secondly, we consider multivariate charts based on exponential smoothing (ewma). lowry et al. (1992) proposed a multivariate extension of the ewma control chart (mewma), which is defined as follows with the 0 < λ ≤ 1 and l 0 = 0 (cf. montgomery, 2012) . the corresponding chart statistic is where the covariance matrix is defined as together with the mcusum, the mewma is an advisable approach for detecting relatively small but persistent changes. however, the detection of large shifts is also possible by setting k or λ high. for instance, in case of the mewma with λ = 1, the chart statistic coincides with d (1) t . thus, it is equivalent to the hotelling's t 2 control procedure, which is suitable for detection of substantial deviations. it is worth to mention that the discussed methods are directionally invariant, therefore, the investigation of the data at the signal time point is necessary if the change direction is of particular interest. is equal to or exceeds the ucl, it means that the charts signal a change. to determine the ucls, one typically assumes that the chart has a predefined (low) probability of false alarms, i.e., signals when the process is in control, or a prescribed in-control average run length arl 0 , i.e., the number of expected time steps until the first signal. to compute the ucls corresponding to arl 0 , a prevalent number of multivariate control charts require a normally distributed target process (cf. johnson and wichern, 2007; porzio and ragozini, 2008; montgomery, 2012) . in our case, this assumption would need to be valid for the estimates of the network model parameters. however, while there are some studies on the distributions of particular network statistics (cf. yan and xu, 2013; yan et al., 2016; sambale and sinulis, 2018) , only a few results are obtained about the distribution of the parameter estimates. primarily, the difficulties to determine the distribution is that the assumption of i.i.d. (independent and identically distributed) data is violated in the ergm case. in addition, the parameters depend on the choice of the model terms and network size (he and zheng, 2015) . kolaczyk and krivitsky (2015) proved asymptotic normality for the ml estimates in a simplified context of the ergm, pointing out the necessity to establish a deeper understanding of the distributional properties of parameter estimates. thus, we do not rely on any distributional assumption, but determine the ucls via monte-carlo simulations in section 3.2. to verify the applicability and effectiveness of the discussed approach, we design a simulation study followed by the surveillance of real-world data with the goal to obtain some insights into its temporal development. in practice, the in-control parameters θ 0 and σ are usually unknown and therefore have to be estimated. thus, one subdivides the sequence of networks into phase i and phase ii. in phase i, the process must coincide with the in-control state. thus, the true in-control parameters θ 0 and σ can be estimated by the sample mean vectorθ and the sample covariance matrix s of the estimated parametersθ t in phase i. using these estimates, the ucl is determined via simulations of the in-control networks, as we will show in the following part. it is important that the phase i replicates the natural behaviour of a network, so that if the network constantly grows, then it is vital to consider this aspect in phase i. similarly, if the type of network is prone to stay unchangeable in terms of additive connections or topological structure, this fact should be captured in phase i for reliable estimation and later network surveillance. after the necessary estimators of θ 0 , σ and the ucl are obtained, the calibrated control chart is applied to the actual data in phase ii. in specific cases of the constantly growing/topologically changing networks, we recommend to recalibrate the control chart after the length of arl 0 to guarantee a trustworthy detection of the outliers. to be able to computeθ and s, we need a certain number of in-control networks. for this purpose, we generate 2300 temporal graph sequences of desired length t < τ, where each graph consists of n = 100 nodes. the parameter τ defines the time stamp when an anomalous change is implemented. the simulation of synthetic networks is based on the markov chain principle: in the beginning, a network which is called the "base network" is simulated by applying an ergm with predefined network terms, so that it is possible to control the "network creation" indirectly. in our case, we select three network statistics, namely an edge term, a triangle term and a parameter that defines asymmetric dyads. subsequently, a fraction φ of elements of the adjacency matrix are randomly selected and set where m i j,0 denotes the probability of a transition from i to j in the in-control state. next, we need to guarantee that the generated samples of networks behave according to the requirements of phase i, i.e., capturing only the usual variation of the target process. for this purpose, we can exploit markov chain properties and calculate its steady state equilibrium vector π, as it follows that the expected number of edges and non-edges is given by π. using eigenvector decomposition, we find the steady state to be π = (0.8, 0.2) . consequently, the expected number of edges in the graph in its steady state is 1980. there are several possibilities to guarantee a generation of appropriate networks. in our case, we simulate 400 networks in a burn-in period, such that the in-control state of phase i starts at t = 401. nevertheless, the network density is only one of the aspects to define the in-control process, as the temporal development and the topology are also involved in the network creation. each network in time point y t is simulated from the network y t−1 by repeating the steps described above. after the generation stage, the coefficients of the network statistics and of an additional term which describes the stability of both edges and non-edges over time with v = 1 are estimated by applying a tergm with a certain window size z. the chosen estimation method is the bootstrap mple which is appropriate to handle a relatively large number of nodes and time points (leifeld et al., 2018) . eventually, we calibrate different control charts by computingθ, s, and the respective ucl via the bisection method. for two window sizes z = {7, 14}, table 1 in the next step, we analyse the performance of the proposed charts in terms of their detection speed. for this reason, we generate samples from phase ii, where t ≥ τ. the focus is on the detection of mean shifts, which are driven by an anomalous change in following three parameters: the vector of coefficients related to network termsθ t , the fraction of the randomly selected adjacency matrix entries φ and the transition matrix m . hence, we subdivide these scenarios into three different anomaly types which are briefly described in the chart flow presented in figure 1 . we define a type 1 anomaly as a persistent change in the values of m . that is, there is a transition matrix m 1 = m 0 when t ≥ τ. furthermore, we consider anomalies of type 2 by introducing a new value φ 1 in the generation process when t ≥ τ. anomalies of type 3 differ from the previous two as it represents a "point change" -the abnormal behaviour occurs only at a single point of time but its outcome affects further development of the network. we recreate this type of anomalies by converting a fraction ζ of asymmetric edges into mutual links. this process happens at time point τ only. afterwards, the new networks are created similar to phase i by applying m 0 and φ 0 up until the anomaly is detected. all cases of different magnitude are summarised in table 3 . as a performance measure we calculate the conditional expected delay (ced) of detection, conditional on a false signal not having been occurred before the ( should be detected (case 2.2, 2.3). again, the reference/smoothing parameter should be chosen according to the expected shift size. for changes of the proportion of mutual edges, anomalies of type 3, the charts have different behaviour. first of all, the mewma chart outperforms in all cases except 3.1 and 3.2 with z = 14. however, the hotelling's chart functions clearly worse in the first two cases having a shorter window size. thus, we would recommend choosing λ = 0.1 if the change in the network topology is relatively small as in case 3.1. in the opposite case of a larger change, λ could be chosen higher depending on the expected size of the shift, so that the control statistic also incorporates previous values. the disadvantage of both approaches is that small and persistent changes are not detected quickly when the parameters k or λ are not optimally chosen. for example, in figure 2 , we can notice that the ced slightly exceeds the arl 0 reflecting the poor performance. however, a careful selection of the parameters and the window size can overcome this problem. to summarise, the effectiveness of the presented charts to detect structural changes depends significantly on the accurate estimation of the anomaly size one aims to detect. thus, to ensure that no anomalies were missed, it can be effective to apply paired up charts and benefit from the strengths of each of them to detect varying types and sizes of anomalies, if the information on the possible change is not available or not reliable. to demonstrate applicability of the described method, the daily flight data of the united states through territories which allow travelling. that means, instead of having a direct journey from one geographical point to another, currently the route passes through several locations, which can be interpreted as nodes. thus, the topology of the graph has changed: instead of directed mutual links, the number of intransitive triads and asymmetric links start to increase significantly. we can incorporate both terms, together with the edge term and a memory term (v = 1), and expect the estimates of the respective coefficients belonging to the first two statistics to be close to zero or strongly negative in the in-control case. initially, we need to decide which data are suitable to define observations coming from phase i, the estimates θ t of the tergm described by a series of boxplots in figure 6 , we can observe extreme changes in the values. before proceeding with the analysis, it is important to evaluate whether a tergm fits the data well . for each of the years, we randomly selected one period of the length z and simulated 500 networks based on the parameter estimates from each of the corresponding networks. to select appropriate control charts, we need to take into consideration specifications of the flight network data. firstly, it is common to have 3-4 travel peaks per year around holidays, which are not explicitly modelled, so that we can detect these changes as verifiable anomalous patterns. it is worth noting that one could account for such seasonality by including nodal or edge covariates. secondly, as we aim to detect considerable deviations from the in-control state, we are more interested the horizontal red line corresponds to the upper control limit and the red points to the occurred signals. in sequences of signals. thus, we have chosen k = 1.5 for mcusum and λ = 0.9 for the mewma chart. the target arl 0 is set to 100 days, therefore, we could expect roughly 3.65 in-control signals per year by construction of the charts. to identify smaller and more specific changes in the daily flight data of the us, one could also integrate nodal and edge covariates which would refer to further aspects of the network. alternatively, control charts with smaller k and λ can be applied. statistical methods can be remarkably powerful for the surveillance of networks. however, due to the complex structure and possibly large size of the adjacency matrix, traditional tools for multivariate process control cannot directly be applied, but the network's complexity must be reduced first. for instance, this can be done by statistical modelling of the network. the choice of the model is crucial as it decides constraints and simplifications of the network which later influence the types of changes we are able to detect. in this paper, we show how multivariate control charts can be used to detect changes in tergm networks. the proposed methods can be applied in real time. this general approach is applicable for various types of networks in terms of the edge direction and topology, as well as allows for the integration of nodal and edge covariates. additionally, we make no assumptions on distribution and account for temporal dependence. the performance of our procedure is evaluated for different anomalous scenarios by comparing the ced of the calibrated control charts. according to the classification and explanation of anomalies provided by ranshous et al. (2015) , the surveillance method presented in this paper is applicable for event and point change detection in temporal networks. the difference between these problems lies in the duration of the abnormal behaviour: while change points indicate a time point when the anomaly is persistent until the next change point, events indicate short-term incidents, after that the network returns to its natural state. eventually, we illustrated the applicability of our approach by monitoring daily flights in the united states. both control charts were able to detect the beginning of the lock-down period due to the covid-19 pandemic. the mewma chart signalled a change just two days after a level 4 "no travel" warning was issued. despite the benefits of the tergm, such as incorporation of the temporal dimension and representation of the network in terms of its sufficient statistics, there are several considerable drawbacks. other than the difficulty to determine a suitable combination of the network terms, the model is not suitable for networks of large size (block et al., 2018) . furthermore, the temporal dependency statistics in the tergm depend on the selected temporal lag and the size of the time window over which the data is modelled (leifeld and cranmer, 2019) . thus, the accurate modelling of the network strongly relies on the analyst's knowledge about its nature. a helpful extension of the approach would be the implementation of the separable temporal exponential random graph model (stergm) that subdivides the network changes into two distinct streams (cf. krivitsky and handcock, 2014; fritz et al., 2020) . in this case, it could be possible to monitor the dissolution and formation of links separately, so that the interpretation of changes in the network would become clearer. regarding the multivariate control charts, there are also some aspects to consider. referring to montgomery (2012) , the multivariate control charts perform well if the number of process variables is not too large, usually up to 10. also, a possible extension of the procedure is to design a monitoring process when the values for σ can vary between the in-control and out-of-control states. whether this factor would beneficially enrich the surveillance remains open for future research. in our case, we did not rely on any distributional assumptions of the parameters, but we used simulation methods to calibrate the charts. hence, further development of adaptive control charts with different characteristics is interesting as they could remarkably improve the performance of the anomaly detection (cf. sparks and wilson, 2019). graph based anomaly detection and description: a survey classes of small-world networks detection of abrupt changes: theory and application change we can believe in: comparing longitudinal network models on consistency, interpretability and predictive power models of random graphs and their applications models and methods in social network analysis tail event driven networks of sifis multivariate generalizations of cumulative sum quality-control schemes the topological structure of the odisha power grid: a complex network analysis a statistical approach to social network monitoring network analysis in psychology statistical analysis of change in networks markov graphs tempus volat, hora fugit: a survey of tie-oriented dynamic network models in discrete and continuous time assessing degeneracy in statistical models of social networks discrete temporal models of social networks glmle: graph-limit enabled fast computation for fitting exponential random graph models to large social networks performance evaluation of ewma and cusum control charts to detect anomalies in social networks using average and standard deviation of degree measures goodness of fit of social network models the past and future of network analysis in economics applied multivariate statistical analysis, 6th edn comparisons of multivariate cusum charts on assessing the performance of sequential procedures for detecting a change statistical analysis of network data on the question of effective sample size in network modeling: an asymptotic inquiry. statistical science: a a separable model for dynamic networks a theoretical and empirical comparison of the temporal exponential random graph model and the stochastic actor-oriented model temporal exponential random graph models with btergm: estimation and bootstrap confidence intervals toward epidemic thresholds on temporal networks: a review and open questions control charts for multivariate processes analyzing dynamic change in social network based on distribution-free multivariate process control method a multivariate exponentially weighted moving average control chart detecting change in longitudinal social networks statistical quality control specification of exponential-family random graph models: terms and computational aspects multivariate cumulative sum control charts based on projection pursuit an overview of dynamic anomaly detection in social networks via control charts continuous inspection schemes multivariate control charts from a data mining perspective. recent advances in data mining of enterprise data anomaly detection in dynamic networks: a survey random graph models for temporal processes in social networks an introduction to exponential random graph (p*) models for social networks monitoring of social network and change detection by applying statistical process: ergm change point detection in social networks using a multivariate exponentially weighted moving average chart logarithmic sobolev inequalities for finite spin systems and applications instability, sensitivity, and degeneracy of discrete exponential families exponential-family models of random graphs: inference in finite-, super-, and infinite population scenarios analyzing complex functional brain networks: fusing statistics and network science to understand the brain new specifications for exponential random graph models monitoring communication outbreaks among an unknown team of actors in dynamic networks network analysis and political science logit models and logistic regressions for social networks: i. an introduction to markov graphs and p* modeling and detecting change in temporal networks via the degree corrected stochastic block model multivariate cusum quality-control procedures a central limit theorem in the β -model for undirected random graphs with a diverging number of vertices asymptotics in directed exponential random graph models with an increasing bi-degree sequence key: cord-253711-a0prku2k authors: mao, liang; yang, yan title: coupling infectious diseases, human preventive behavior, and networks – a conceptual framework for epidemic modeling date: 2011-11-26 journal: soc sci med doi: 10.1016/j.socscimed.2011.10.012 sha: doc_id: 253711 cord_uid: a0prku2k human-disease interactions involve the transmission of infectious diseases among individuals and the practice of preventive behavior by individuals. both infectious diseases and preventive behavior diffuse simultaneously through human networks and interact with one another, but few existing models have coupled them together. this article proposes a conceptual framework to fill this knowledge gap and illustrates the model establishment. the conceptual model consists of two networks and two diffusion processes. the two networks include: an infection network that transmits diseases and a communication network that channels inter-personal influence regarding preventive behavior. both networks are composed of same individuals but different types of interactions. this article further introduces modeling approaches to formulize such a framework, including the individual-based modeling approach, network theory, disease transmission models and behavioral models. an illustrative model was implemented to simulate a coupled-diffusion process during an influenza epidemic. the simulation outcomes suggest that the transmission probability of a disease and the structure of infection network have profound effects on the dynamics of coupled-diffusion. the results imply that current models may underestimate disease transmissibility parameters, because human preventive behavior has not been considered. this issue calls for a new interdisciplinary study that incorporates theories from epidemiology, social science, behavioral science, and health psychology. despite outstanding advance in medical sciences, infectious diseases remain a major cause of death in the world, claiming millions of lives every year (who, 2002) . particularly in the past decade, emerging infectious diseases have obtained remarkable attention due to worldwide pandemics of severe acute respiratory syndrome (sars), bird flu and new h1n1 flu. although vaccination is a principal strategy to protect individuals from infection, new vaccines often need a long time to develop, test, and manufacture (stohr & esveld, 2004) . before sufficient vaccines are available, the best protection for individuals is to adopt preventive behavior, such as wearing facemasks, washing hands frequently, taking pharmaceutical drugs, and avoiding contact with sick people, etc. (centers for disease control and prevention, 2008) . it has been widely recognized that both infectious diseases and human behaviors can diffuse through human networks (keeling & eames, 2005; valente, 1996) . infectious diseases often spread through direct or indirect human contacts, which form infection networks. for example, influenza spreads through droplet/physical contacts among individuals, and malaria transmits via mosquitoes between human hosts. human behavior also propagates through inter-personal influence that fashions communication networks. this is commonly known as the 'social learning' or 'social contagion' effect in behavioral science, i.e., people can learn by observing behaviors of others and the outcomes of those behaviors (hill, rand, nowak, christakis, & bergstrom, 2010; rosenstock, strecher, & becker, 1988 ). in the current literature, models of disease transmission and behavioral diffusion have been developed separately for decades, both based on human networks (deffuant, huet, & amblard, 2005; keeling & eames, 2005; valente, 1996; watts & strogatz, 1998) . few efforts, however, have been devoted to integrating infectious diseases and human behaviors together. in reality, when a disease breaks out in a population, it is natural that individuals may voluntarily adopt some preventive behavior to respond, which in turn limits the spread of disease. failing to consider these two interactive processes, current epidemic models may under-represent human-disease interactions and bias policy making in public health. this article aims to propose a conceptual framework that integrates infectious diseases, human preventive behavior, and networks together. the focus of this article is on issues that arise in establishing a conceptual framework, including basic principles, assumptions, and approaches for model formulization. the following section (section 2) describes the conceptual framework and basic assumptions, which abstract essential aspects of a disease epidemic. the third section discusses approaches to formulize the model framework into a design. the fourth illustrates a computing model upon various human network structures and compares the simulation results. the last section concludes the article with implications. the conceptual model consists of two networks and two diffusion processes (fig. 1) . the two networks include an infection network that transmits disease agents (dark dash lines), and a communication network that channels inter-personal influence regarding preventive behavior (gray dash lines). both networks are composed of same individuals but different types of interactions. these two networks could be non-overlapping, partially or completely overlapping with one another. the two diffusion processes refer to the diffusion of infectious diseases (dark arrows) and that of preventive behavior (gray arrows) through the respective network. as illustrated in fig. 1 , if individual #1 is initially infected, the disease can be transmitted to individual #2 and #3, and then to individual #4, following the routes of infection network. meanwhile, individual #2 may perceive the risk of being infected from individual #1, and then voluntarily adopt preventive behavior for protection, known as the effects of 'perceived risks' (becker, 1976) . further, the preventive behavior of individual #2 may be perceived as a 'social standard' by individual #4 and motivate him/her toward adoption, i.e., the 'social contagion'. in such a manner, the preventive behavior diffuses on the communication network through inter-personal influence. during an epidemic, these two diffusion processes take place simultaneously and interact in opposite directions. the diffusion of diseases motivates individuals to adopt preventive behavior, which, in turn, limits the diffusion of diseases. this two-network two diffusion framework is dubbed as a 'coupled diffusion' in the subsequent discussion. the conceptual framework entails five assumptions. first, individuals differ in their characteristics and behaviors, such as their infection status, adoption status, and individualized interactions. second, both infection and communication networks are formed by interactions among individuals. third, the development of infectious diseases follows disease natural history, such as the incubation, latent, and infectious periods. fourth, individuals voluntarily adopt preventive behavior, dependent on their own personality, experiences, and inter-personal influence from family members, colleagues, as well as friends (glanz, rimer, & lewis, 2002) . fifth and lastly, the infection status of surrounding people or their behavior may motivate individuals to adopt preventive behavior, which then reduces the likelihood of infection. of the five assumptions, the first two provide networks as a basis for modeling. the third and fourth assumptions are relevant to the two diffusion processes, respectively. the last assumption represents the interactions between the two processes. corresponding to the five assumptions, this article introduces a number of approaches to represent individuals, networks, infectious diseases, and preventive behavior, as four model components, and depicts the relationships between the four. the first model assumption requires a representation of discrete individuals, their unique characteristics and behaviors. this requirement can be well addressed by an individual-based modeling approach. in the last decade, this modeling approach has gained momentum in the research community of both epidemiology and behavioral science (judson, 1994; koopman & lynch, 1999) . specifically, the individual-based approach views a population as discrete individuals, i.e., every individual is a basic modeling unit and has a number of characteristics and behaviors. the characteristics indicate states of individuals, e.g., the infection status, adoption status, and the number of contacts, while the behaviors change these states, e.g., receiving infection and adopting preventive behavior. by simulating at an individual level, this approach allows to understand how the population characteristics, such as the total number of infections and adopters, emerge from collective behaviors of individuals (grimm & railsback, 2005) . from an implementation perspective, the characteristics and behaviors of individuals can be easily accommodated by object-oriented languages, a mainstream paradigm of programming technologies. various tools are also available to facilitate the design and implementation of individual-based approach, such as the netlog and repast (robertson, 2005) . with regard to the second assumption, both the infection and communication networks can be abstracted as a finite number of nodes and links. nodes represent individuals and links represent interactions among individuals. the network structure is compatible with the aforementioned individual-based approach, in that the individual nodes directly correspond to the basic modeling units, while links can be treated as a characteristic of individuals. interactions between individuals (through links) can be represented as behaviors of individuals. to be realistic in modeling, both networks can be generated to fit observed characteristics and structures of real-world networks. important characteristics of networks include: the number of links attached to a node (the node degree), the minimum number of links between any pair of nodes (the path length), the ratio between the existing number of links and the maximum possible number of links among certain nodes (the level of clustering), and so on (scott, 2000) . particularly for human networks of social contacts, empirical studies showed that the average node degree often varies from 10 to 20, dependent on occupation, race, geography, etc (edmunds, kafatos, wallinga, & mossong, 2006; fu, 2005) . the average path length was estimated to be around 6, popularly known as the 'six degrees of separation' (milgram, 1967) . the level of clustering has typical values in the range of 0.1e0.5 (girvan & newman, 2002) . besides these characteristics, studies on human networks have also disclosed two generic structures: "small-world" and "scale-free" structures. the "small-world" structure is named after the 'small-world' phenomena, arguing that people are all connected by short chains of acquaintances (travers & milgram, 1969) . theoretically, the small-world structure is a transition state between regular networks and random networks (watts & strogatz, 1998) . the regular networks represent one extreme that all nodes are linked to their nearest neighbors, resulting in highly clustered networks. the random networks are the other extreme that all nodes are randomly linked with each other regardless of their closeness, resulting in short path lengths. a typical small-world structure has characteristics from both extremes, i.e., most nodes are directly linked to others nearby (highly clustered), but can be indirectly connected to any distant node through a few links (short path lengths). the "scale-free" structure has also been commonly observed in social, biological, disease, and computer networks, etc. (cohen, erez, ben-avraham, & havlin, 2001; jeong, tombor, albert, oltvai, & barabási, 2000; liljeros, edling, amaral, stanley, & aaberg, 2001) . it depicts a network with highly heterogeneous degrees of nodes, whose distribution follows a power-law decay function, p < k > wk àg (k denotes the node degree and empirically 2 < g < 3). in other words, a few individuals have a significantly large number of links, while the rest of individuals only have a few (albert, jeong, & barabasi, 2000) . all of these observed characteristics and structures can be used to calibrate the modeled networks, which then serve as a reliable basis to simulate the coupled-diffusion process. in epidemiology, the development of infectious diseases has been characterized by a series of infection statuses, events, and periods, often referred to as the natural history of diseases (gordis, 2000) . the progress of an infectious disease often starts with a susceptible individual. after having contact with an infectious individual, this susceptible individual may receive disease agents and develop infection based on a transmission probability. the receipt of infection triggers a latent period, during which the disease agents develop internally in the body and are not emitted. the end of the latent period initiates an infectious period, in which this individual is able to infect other susceptible contacts and may manifest disease symptoms. after the infectious period, this individual either recovers or dies from the disease. among these disease characteristics, the transmission probability is critical for bridging infectious diseases to the other model components: individuals, networks, and preventive behavior. this probability controls the chance that the disease agents can be transmitted between individuals through network links. meanwhile, the reduction of transmission probability reflects the efficacy of preventive behavior. the individual-based modeling approach enables the representation of disease progress for each individual. the infection statuses, periods, and transmission probability per contact can be associated with individuals as their characteristics, while infection events (e.g., receipt of infection and emission of agents) can be modeled as behaviors of individuals. each individual has one of four infection statuses at a time point, either susceptible, latent, infectious, or recovered (kermack & mckendrick, 1927) . the infection status changes when infection events are triggered by behaviors of this individual or surrounding individuals. the simulation of disease transmission often starts with an introduction of a few infectious individuals (infectious seeds) into a susceptible population. then, the first generation of infections can be identified by searching susceptible contacts of these seeds. stochastic methods, such as the monte carlo method, could be used to determine who will be infected or not. subsequently, the first generation of infections may further infect their contacts, and over time leads to a cascade diffusion of disease over the network. to parameterize the simulation, the transmission probability of a disease, the lengths of latent period and infectious period can be derived from the established literature or from observational disease records. like other human behaviors, the adoption of preventive behavior depends on the individual's own characteristics (e.g., knowledge, experience, and personal traits) and inter-personal influence from surrounding individuals (e.g., family supports and role model effects) (glanz et al., 2002) . because individuals vary in their willingness to adopt, human behaviors often diffuse from a few early adopters to the early majority, and then over time throughout the social networks (rogers, 1995) . a number of individual-based models have been developed by sociologists and geographers to represent such behavioral diffusion processes, e.g., the mean-information-field (mif) model (hägerstrand, 1967) , the threshold model (granovetter, 1978) , the relative agreement model (deffuant et al., 2005) , etc. the mif model populates individuals on a regular network (or a grid), and assumes that a behavior diffuses through the 'word-of-mouth' communication between an adopter and his/ her neighbors. the mif is a moving window that defines the size of neighborhood and the likelihood of human communications to every adopter. the simulation centers the mif on every adopter and uses the monte carlo method to identify a new generation of adopters (hägerstrand, 1967) . the threshold model assumes that individuals observe their surroundings and adopt a behavior based on a threshold effect (granovetter, 1978; valente, 1996) . the threshold is the proportion of adopters in an individual's social contacts necessary to convince this individual to adopt. the behavioral diffusion begins with a small number of adopters, and spreads from the low-threshold population to the high-threshold population. the recently proposed relative agreement model assumes that every individual holds an initial attitude, which is a value range specified by a mean value, maximum and minimum. based on the value ranges, individuals' attitudes are categorized as positive, neutral, and negative. individuals communicate through a social network, and influence their attitudes (value ranges) reciprocally according to mathematical rules of relative agreement. if individuals can hold positive attitudes for a certain time period, they will decide to adopt a behavior (deffuant et al., 2005) . due to the individual-based nature of all these models, they can be easily incorporated under the proposed conceptual framework. to further discuss the individual-based design of behavioral models, this research chose the threshold model for illustrations. in terms of complexity, the threshold model lies midway between the mif model and the relative agreement model, and its parameters can be feasibly estimated through social surveys. the mif model has been criticized for its simplicity in that it assumes an immediate adoption after a communication and oversimplifies the decision process of individuals (shannon, bashshur, & metzner, 1971) . by contrast, the relative agreement model is too sophisticated: many parameters are difficult to estimate, for example, the ranges of individual attitudes. the threshold model can be formulized as follows so as to become an integral part of the coupled-diffusion framework. first, individuals are assumed to spontaneously evaluate the proportion of adopters among their contacts, and perceive the pressure of adoption. once the perceived pressure reaches a threshold (hereinafter called the threshold of adoption pressure), an individual will decide to adopt preventive behavior. second, in order to relate the preventive behavior to infectious diseases, individuals also evaluate the proportion of infected individuals (with disease symptoms) among their contacts, and perceive the risks of infection. once the perceived risk reaches another threshold (hereinafter called the threshold of infection risk), an individual will also adopt preventive behavior. these two threshold effects can be further formulized as three characteristics and two behaviors of individuals. the three characteristics include an adoption status (adopter or non-adopter) and two individualized thresholds toward adoption. the two behaviors represent the individual's evaluation of adoption pressure and infection risk from surrounding contacts, which in turn determines their adoption status. the individualized thresholds toward adoption reflect personal characteristics of individuals, while the behaviors of evaluation represent the inter-personal influence between individuals. to build a working model, the individualized thresholds toward adoption can be best estimated by health behavior surveys as illustrated below. based on the discussion above, the conceptual framework ( fig. 1) can be transformed into a formative design with four model components and their relationships (fig. 2) . individuals are building blocks of the proposed model, and their interactions compose networks as a core of the model. through the infection network, individuals may receive infection from others and have their infection status changed, propelling the diffusion of diseases. meanwhile, individuals may perceive risks and pressure from the communication network, and gradually adopt preventive behavior, resulting in the behavioral diffusion. the adoption of preventive behavior reduces the disease transmission probability, thus controlling and preventing the disease transmission. in this manner, the diffusion of diseases and preventive behavior in a population are coupled together. to illustrate the proposed coupled-diffusion model, an influenza epidemic was simulated in a hypothetic population of 5000 individuals (n ¼ 5000), each with characteristics and behaviors as described in fig. 2 . influenza was chosen because it is common and readily transmissible between individuals. the simulation simply assumes that the population is closed, i.e., no births, deaths, or migrations. with regard to the network component, the average number of links per individuals was set to 12, reasonably assuming that an individual on average has contact with 2 family members and 10 colleagues. for the purpose of sensitivity analysis, the illustrative model allowed the disease and communication networks to take either a small-world (sw) structure or a scale-free (sf) structure. the generation of sw structures started with a regular network where all individuals were linked to their nearest neighbors. then, each individual's existing links were rewired with a probability to randomly selected individuals (watts & strogatz, 1998) . the rewiring probability p ranged from 0 to 1, and governed the clustering level and average path lengths of resultant networks (fig. 3a) . the sf structures were created by a preferential attachment algorithm, which linked each new individual preferentially to those who already have a large number of contacts (pastor-satorras & vespignani, 2001) . this algorithm produces a power-law degree distribution, p < k > wk àg (k is the node degree), with various exponents g (fig. 3b) . based on fig. 3a and b, the rewiring probabilities p were set to 0.005, 0.05, and 0.5 to typically represent the regular, small-world, and random networks, respectively (fig. 3cee) . the exponent g were set to 3, 5, and 7 to represent three scale-free networks with high, medium, and low levels of node heterogeneity (fig. 3feh) . a sensitivity analysis was performed to examine every possible pair of
, , and as a network combination (3 p-values â 3 g-values â 4 ¼ 36 combinations in total), where the first parameter indicates the structure of infection network and the second specifies the structure of communication network. to simulate the diffusion of influenza, the latent period and infectious period were specified as 1 day and 4 days, respectively, based on published estimates (heymann, 2004) . the transmission probability per contact was varied from 0.01 to 0.1 (with a 0.01 increment) to test its effects on the coupled-diffusion processes. 50% of infected individuals was assumed to manifest symptoms, following the assumption made by ferguson et al. (2006) . only these symptomatic individuals could be perceived by their surrounding individuals as infection risks. recovered individuals were deemed to be immune to further infection during the rest of the epidemic. with respect to the diffusion of preventive behavior, the use of flu antiviral drugs (e.g., tami flu and relenza) was taken as a typical example because its efficacy is more conclusive than other preventive behavior, such as hand washing and facemask wearing. for symptomatic individuals, the probability of taking antiviral drugs was set to 75% (mcisaac, levine, & goel, 1998; stoller, forster, & portugal, 1993) , and the consequent probability of infecting others was set to be reduced by 40% (longini, halloran, nizam, & yang, 2004) . susceptible individuals may also take antiviral drugs due to the perceived infection risk or adoption pressure. if they use antiviral drugs, the probability of being infected was set to be reduced by 70% (hayden, 2001) . the key to simulate the diffusion of preventive behavior was to estimate thresholds of infection risk and that of adoption pressure for individuals. a health behavior survey was conducted online for one month (march 12eapril 12, 2010) to recruit participants. voluntary participants were invited to answer two questions: 1) "suppose you have 10 close contacts, including household members, colleagues, and close friends, after how many of them get influenza would you consider using flu drugs?", and 2) "suppose you have 10 close contacts, including household members, colleagues, and close friends, after how many of them start to use flu drugs would you consider using flu drugs, too?". the first question was designed to estimate the threshold of infection risks, while the second was for the threshold of adoption pressure. the survey ended up with 262 respondents out of 273 participants (a 96% response rate), and their answers were summarized into two threshold-frequency distributions (fig. 4) . the monte carlo method was then performed to assign threshold values to the 5000 modeled individuals based on the two distributions. this survey was approved by the irb in university at buffalo. to initialize the simulation, all 5000 individuals were set to be non-adopters and susceptible to influenza. one individual was randomly chosen to be infectious on the first day. the model took a daily time step and simulated the two diffusion processes simultaneously over 200 days. the simulation results were presented as disease attack rates (total percent of symptomatic individuals in the population), and adoption rates (total percent of adopters in the population). another two characteristics were derived to indicate the speed of coupled-diffusion: the epidemic slope and the adoption slope. the former is defined as the total number of symptomatic individuals divided by the duration of an epidemic (in day). similarly, the latter is defined as the total number of adopters divided by the duration of behavioral diffusion (in day). they are called slopes because graphically they approximate the slopes of cumulative diffusion curves. a higher slope implies a faster diffusion because of more infections/adoptions (the numerator) in a shorter time period (the denominator). all simulation results were averaged by 50 model realizations to average out the randomness. simulation results were presented in two parts. first, the coupled-diffusion process under various transmission probabilities was analyzed, and compared to an influenza-only process that is fig. 3. (a) standardized network properties (average path length and clustering coefficient) as a function of rewiring probability p from 0 to 1, given n ¼ 5000; (b) the power-law degree distributions given g ¼ 3, 5 and 7, given n ¼ 5000; (cee) an illustration of generated sw networks for three p values, given n ¼ 100 for figure clarity; (feh) an illustration of sf networks for three g values, given n ¼ 100. widely seen in the literature. the influenza-only process was simulated with the same parameters in the coupled-diffusion process except that individual preventive behavior was not considered. for the ease of comparison, a typical "small-world" network (p ¼ 0.05), was chosen for both infection and communication networks, assuming the two are overlapping. the second part examined the dynamics of coupled-diffusion under various structures of infection and communication networks, i.e., the 36 pairs of network parameters , , and while fixing the influenza transmission probability to 0.05 (resultant basic reproductive number r 0 ¼ 1e1.3). fig. 5a indicates that the diffusion of influenza with and without the preventive behavior differs significantly, particularly for medium transmission probabilities (0.04e0.06). for the influenzaonly process (the black curve with triangles), the disease attack rate rises dramatically as the transmission probability exceeds 0.03, and reaches a plateau of 50% when the probability increases to 0.07. the coupled-diffusion process (the black curve with squares) produces lower attack rates, which slowly incline to the maximum of 45%. this is because individuals gradually adopt preventive behavior, thereby inhibiting disease transmission from infectious individuals to the susceptible. meanwhile, the adoption rate (the gray curve with squares) also increases with the transmission probability, and can achieve a 65% of the population as the maximum. this is not surprising because the more individuals get infected, the greater risks and pressure other individuals may perceive, motivating them to adopt preventive behavior. individuals who have not adopted eventually may have extremely high-threshold of adoption (see fig. 4 ), and thus resist adopting preventive behavior. fig. 5b displays an example of the coupled-diffusion process (transmission probability ¼ 0.05), ending up with nearly 2000 symptomatic cases and approximately 3000 adopters of flu antiviral drugs. despite differences in magnitude, the two diffusion curves exhibit a similar trend that follows the 5-phase s-shaped curve of innovation diffusion (rogers, 1995) . the 'innovation' phase occurs from the beginning to day 30, followed by the 'early acceptance' phase (day31e50), 'early majority' (day 51e70), 'late majority' (day 71e90) and 'laggards' (after day 90). this simulated similarity in temporal trend is consistent with many empirical studies regarding flu infection and flu drug usage. for example, das et al. (2005) and magruder (2003) had compared the temporal variation of both influenza incidence and over-the-counter flu drug sales in the new york city and the washington dc metropolitan area, respectively. both studies reported a high correlation between over-the-counter drug sales and cases of diagnosed influenza, and thus suggested that over-the-counter drug sales could be a possible early detector of disease outbreaks. the consistency with the observed facts, to some extent, reflects the validity of the proposed model. in addition to the transmission probability, the coupled-diffusion process is also sensitive to various combinations of network structures, i.e., 36 pairs of network parameters , , and (fig. 6) . the z axis represents either the epidemic or adoption slope, and a greater value indicates a faster diffusion process. in general, both epidemic and adoption slopes change dramatically with the structure of infection network, while they are less sensitive to the variation of communication networks. given the small-world infection network ( fig. 6aeb and e-f), the epidemic and adoption slopes increase quickly as the rewiring probability p rises from 0.005 to 0.5. when p ¼ 0.005 (a regular network), almost all individuals are linked to their nearest neighbors, and influenza transmission between two distant individuals needs to go through a large number of intermediate individuals. the slow spread of influenza induces a low perception of infection risks among individuals, thereby decelerating the dissemination of preventive fig. 6 . the sensitivity of coupled-diffusion processes to various network structures, including sw infection àsw communication as , sf infection àsf communication as sw infection àsf communication as and sf infection àsw communication as . each combination is displayed in one row from top to bottom. the sw and sf denote the network structure, while the subscripts indicate the network function. parameter p is the rewiring probability of a sw network, taking values (0.005, 0.05, 0.5), while parameter g is the exponent of a sf network, taking values (3, 5, 7). the z axis denotes epidemic slopes (the left column) and adoption slopes (the right column) as a result of a network structure. a greater z value indicates a faster diffusion process. behavior. as p increases to 0.5 (a random network), a large number of shortcuts exist in the network, and the transmission of influenza is greatly speeded by shortcuts. as a result, the diffusion of preventive behavior is also accelerated, because individuals may perceive more risks of infection and take actions quickly. likewise, given a scale-free infection network ( fig. 6ced and g-h) , both influenza and preventive behavior diffuse much faster in a highly heterogeneous network (g ¼ 3) than in a relatively homogeneous network (g ¼ 7) . this is because a highly heterogeneous network has a few super-spreaders who have numerous direct contacts. super-spreaders act as hubs directly distributing the influenza virus to a large number of susceptible individuals, thus speeding the disease diffusion. as individuals perceived more risks of infection in their surroundings, they will adopt preventive behavior faster. human networks, infectious diseases, and human preventive behavior are intrinsically inter-related, but little attention has been paid to simulating the three together. this article proposes a conceptual framework to fill this knowledge gap and offer a more comprehensive representation of the disease system. this twonetwork two diffusion framework is composed of four components, including individuals, networks, infectious diseases, and preventive behavior of individuals. the individual-based modeling approach can be employed to represent discrete individuals, while network structures support the formulization of individual interactions, including infection and communication. disease transmission models and behavioral models can be embedded into the network structures, and simulate disease infection and adoptive behavior, respectively. the collective changes in individuals' infection and adoption status represent the coupled-diffusion process at the population level. compared to the widely used influenza-only models, the proposed model produces a lower percent of infection, because preventive behavior protects certain individuals from being infected. sensitivity analysis identifies that the structure of infection network is a dominant factor in the coupled-diffusion, while the variation of communication network produces fewer effects. this research implies that current predictions about disease impacts might be under-estimating the transmissibility of the disease, e.g., the transmission probability per contact. modelers fit to observed data in which populations are presumably performing preventive behavior, while the models they create do not account for the preventive behavior. when they match their modeled infection levels to those in these populations, the disease transmissibility needs to be lower than its true value so as to compensate for the effects of preventive behavior. this issue has been mentioned in a number of recent research, such as ferguson et al. (2006) , but the literature contains few in-depth studies. this article moves the issue towards its solution, and stresses the importance of understanding human preventive behavior before policy making. the study raises an additional research question concerning social-distancing interventions for disease control, such as the household quarantine and workplace/school closure. admittedly, these interventions decompose the infection network for disease transmission, but they may also break down the communication network and limit the propagation of preventive behavior. the costs and benefits of these interventions remain unclear and a comprehensive evaluation is needed. the proposed framework also suggests several directions for future research. first, although the illustrative model is based on a hypothetical population, the representation principles outlined in this article can be applied to a real population. more realistic models can be established based on the census data, workplace data, and health survey data. second, the proposed framework focuses on inter-personal influence on human behavior, but has not included the effects of mass media, another channel of behavioral diffusion. the reason is that the effects of mass media remain inconclusive and difficult to quantify, while the effects of interpersonal influence have been extensively studied before. third, the proposed framework has not considered the 'risk compensation' effect, i.e., individuals will behave less cautiously in situations where they feel safer or more protected (cassell, halperin, shelton, & stanton, 2006) . in the context of infectious diseases, the risk compensation can be interpreted as individuals being less cautious of the disease if they have taken antiviral drugs, which may facilitate the disease transmission. this health psychological effect could also be incorporated to refine the framework. to summarize, this article proposes a synergy between epidemiology, social sciences, and human behavioral sciences. for a broader view, the conceptual framework could be easily expanded to include more theories, for instance, from communications, psychology, and public health, thus forming a new interdisciplinary area. further exploration in this area would offer a better understanding of complex human-disease systems. the knowledge acquired would be of a great significance given that vaccines and manpower may be insufficient to combat emerging infectious diseases. error and attack tolerance of complex networks the health belief model and personal health behavior hiv and risk behaviour: risk compensation: the achilles' heel of innovations in hiv prevention? breakdown of the internet under intentional attack monitoring over-the-counter medication sales for early detection of disease outbreakse new york city an individual-based model of innovation diffusion mixing social value and individual benefit mixing patterns and the spread of close-contact infectious diseases strategies for mitigating an influenza pandemic measuring personal networks with daily contacts: a single-item survey question and the contact diary community structure in social and biological networks health behavior and health education: theory, research, and practice epidemiology. philadelphia: wb saunders threshold models of collective behavior individual-based modeling and ecology on monte carlo simulation of diffusion perspectives on antiviral use during pandemic influenza control of communicable diseases manual infectious disease modeling of social contagion in networks the large-scale organization of metabolic networks the rise of the individual-based model in ecology networks and epidemic models a contribution to the mathematical theory of epidemics individual causal models and population system models in epidemiology the web of human sexual contacts containing pandemic influenza with antiviral agents evaluation of over-the-counter pharmaceutical sales as a possible early warning indicator of human disease visits by adults to family physicians for the common cold epidemic spreading in scale-free networks agent-based modeling toolkits netlogo, repast, and swarm diffusion of innovations social learning theory and the health belief model social network analysis: a handbook the spatial diffusion of an innovative health care plan will vaccines be available for the next influenza pandemic? self-care responses to symptoms by older people. a health diary study of illness behavior an experimental study of the small world problem social network thresholds in the diffusion of innovations collective dynamics of small-world networks world health organization report on infectious diseases. world health organization the authors are thankful for insightful comments from the editor and two reviewers. key: cord-336747-8m7n5r85 authors: grossmann, g.; backenkoehler, m.; wolf, v. title: importance of interaction structure and stochasticity for epidemic spreading: a covid-19 case study date: 2020-05-08 journal: nan doi: 10.1101/2020.05.05.20091736 sha: doc_id: 336747 cord_uid: 8m7n5r85 in the recent covid-19 pandemic, computer simulations are used to predict the evolution of the virus propagation and to evaluate the prospective effectiveness of non-pharmaceutical interventions. as such, the corresponding mathematical models and their simulations are central tools to guide political decision-making. typically, ode-based models are considered, in which fractions of infected and healthy individuals change deterministically and continuously over time. in this work, we translate an ode-based covid-19 spreading model from literature to a stochastic multi-agent system and use a contact network to mimic complex interaction structures. we observe a large dependency of the epidemic's dynamics on the structure of the underlying contact graph, which is not adequately captured by existing ode-models. for instance, existence of super-spreaders leads to a higher infection peak but a lower death toll compared to interaction structures without super-spreaders. overall, we observe that the interaction structure has a crucial impact on the spreading dynamics, which exceeds the effects of other parameters such as the basic reproduction number r0. we conclude that deterministic models fitted to covid-19 outbreak data have limited predictive power or may even lead to wrong conclusions while stochastic models taking interaction structure into account offer different and probably more realistic epidemiological insights. on march 11th, 2020, the world health organization (who) officially declared the outbreak of the coronavirus disease 2019 (covid-19) to be a pandemic. by this date at the latest, curbing the spread of the virus then became a major worldwide concern. given the lack of a vaccine, the international community relied on non-pharmaceutical interventions (npis) such as social distancing, mandatory quarantines, or border closures. such intervention strategies, however, inflict high costs on society. hence, for political decision-making it is crucial to forecast the spreading dynamics and to estimate the effectiveness of different interventions. mathematical and computational modeling of epidemics is a long-established research field with the goal of predicting and controlling epidemics. it has developed epidemic spreading models of many different types: data-driven and mechanistic as well as deterministic and stochastic approaches, ranging over many different temporal and spatial scales (see [49, 15] for an overview). computational models have been calibrated to predict the spreading dynamics of the covid-19 pandemic and influenced public discourse. most models and in particular those with high impact are based on ordinary differential equations (odes). in these equations, the fractions of individuals in certain compartments (e.g., infected and healthy) change continuously and deterministically over time, and interventions can be modeled by adjusting parameters. in this paper, we compare the results of covid-19 spreading models that are based on odes to results obtained from a different class of models: stochastic spreading processes on contact networks. we argue that virus spreading models taking into account the interaction structure of individuals and reflecting the stochasticity of the spreading process yield a more realistic view on the epidemic's dynamic. if an underlying interaction structure is considered, not all individuals of a population meet equally likely as assumed for ode-based models. a wellestablished way to model such structures is to simulate the spreading on a network structure that represents the individuals of a population and their social contacts. effects of the network structure are largely related to the epidemic threshold which describes the minimal infection rate needed for a pathogen to be able to spread over a network [37] . in the network-free paradigm the basic reproduction number (r 0 ), which describes the (mean) number of susceptible individuals infected by patient zero, determines the evolution of the spreading process. the value r 0 depends on both, the connectivity of the society and the infectiousness of the pathogen. in contrast, in the network-based paradigm the interaction structure (given by the network) and the infectiousness (given by the infection rate) are decoupled. here, we focus on contact networks as they provide a universal way of encoding real-world interaction characteristics like super-spreaders, grouping of different parts of the population (e.g. senior citizens or children with different contact patterns), as well as restrictions due to spatial conditions and mobility, and household structures. moreover, models based on contact networks can be used to predict the efficiency of interventions [38, 34, 5] . here, we analyze in detail a network-based stochastic model for the spreading of covid-19 with respect to its differences to existing ode-based models and the sensitivity of the spreading dynamics on particular network features. we calibrate both, ode-models and stochastic models with interaction structure to the same basic reproduction number r 0 or to the same infection peak and compare the corresponding results. in particular, we analyze the changes in the effective reproduction number over time. for instance, early exposure of superspreaders leads to a sharp increase of the reproduction number, which results in a strong increase of infected individuals. we compare the times at which the number of infected individuals is maximal for different network structures as well as the death toll. our results show that the interaction structure has a major impact on the spreading dynamics and, in particular, important characteristic values deviate strongly from those of the ode model. in the last decade, research focused largely on epidemic spreading, where interactions were constrained by contact networks, i.e. a graph representing the individuals (as nodes) and their connectivity (as edges). many generalizations, e.g. to weighted, adaptive, temporal, and multi-layer networks exist [31, 44] . here, we focus on simple contact networks without such extensions. spreading characteristics on different contact networks based on the susceptible-infected-susceptible (sis) or susceptible-infected-recovered (sir) compartment model have been investigated intensively. in such models, each individual (node) successively passes through the individual stages (compartments). for an overview, we refer the reader to [35] . qualitative and quantitative differences between network structures and network-free models have been investigated in [22, 2] . in contrast, this work considers a specific covid-19 spreading model and focuses on those characteristics that are most relevant for covid-19 and which have, to the best of our knowledge, not been analyzed in previous work. sis-type models require knowledge of the spreading parameters (infection strength, recovery rate, etc.) and the contact network, which can partially be inferred from real-world observations. currently for covid-19, inferred data seems to be of very poor quality [24] . however, while the spreading parameters are subject to a broad scientific discussion, publicly available data, which could be used for inferring a realistic contact network, practically does not exist. therefore real-world data on contact networks are rare [30, 45, 23, 32, 43] and not available for large-scale populations. a reasonable approach is to generate the data synthetically, for instance by using mobility and population data based on geographical diffusion [46, 17, 36, 3] . for instance, this has been applied to the influenza virus [33] . due to the major challenge of inferring a realistic contact network, most of these works, however, focus on how specific network features shape the spreading dynamics. literature abounds with proposed models of the covid-19 spreading dynamics. very influential is the work of neil ferguson and his research group that regularly publishes reports on the outbreak (e.g. [11] ). they study the effects of different interventions on the outbreak dynamics. the computational modeling is based on a model of influenza outbreaks [19, 12] . they present a very high-resolution spatial analysis based on movement-data, air-traffic networks etc. and perform sensitivity analysis on the spreading parameters, but to the best of our knowledge not on the interaction data. interaction data were also inferred locally at the beginning of the outbreak in wuhan [4] or in singapore [40] and chicago [13] . models based on community structures, however, consider isolated (parts of) cities and are of limited significance for large-scale model-based analysis of the outbreak dynamic. another work focusing on interaction structure is the modeling of outbreak dynamics in germany and poland done by bock et al. [6] . the interaction structure within households is modeled based on census data. inter-household interactions are expressed as a single variable and are inferred from data. they then generated "representative households" by re-sampling but remain vague on many details of the method. in particular, they only use a single value to express the rich types of relationships between individuals of different households. a more rigorous model of stochastic propagation of the virus is proposed by arenas et al. [1] . they take the interaction structure and heterogeneity of the population into account by using demographic and mobility data. they analyze the model by deriving a mean-field equation. mean-field equations are more suitable to express the mean of a stochastic process than other ode-based methods but tend to be inaccurate for complex interaction structures. moreover, the relationship between networked-constrained interactions and mobility data remains unclear. other notable approaches use sir-type methods, but cluster individuals into age-groups [39, 28] , which increases the model's accuracy. rader et al. [41] combined spatial-, urbanization-, and census-data and observed that the crowding structure of densely populated cities strongly shaped the epidemics intensity and duration. in a similar way, a meta-population model for a more realistic interaction structure has been developed [8] without considering an explicit network structure. the majority of research, however, is based on deterministic, network-free sir-based ode-models. for instance, the work of josé lourenço et al. [29] infers epidemiological parameters based on a standard sir model. similarly, dehning et al. [9] use an sir-based ode-model, but the infection rate may change over time. they use their model to predict a suitable time point to loosen npis in germany. khailaie et al. analyze how changes in the reproduction number ("mimicking npis") affect changes in the epidemic dynamics using epidemic simulations [25] , where a variant of the deterministic, network-free sir-model is used and modified to include states (compartments) for hospitalized, deceased, and asymptotic patients. otherwise, the method is conceptually very similar to [29, 9] and the authors argue against a relaxation of npis in germany. another popular work is the online simulator covidsim 1 . the underlying method is also based on a network-free sir-approach [50, 51] . however, the role of an interaction structure is not discussed and the authors explicitly state that they believe that the stochastic effects are only relevant in the early stages of the outbreak. a very similar method has been developed at the german robert-koch-institut (rki) [7] . jianxi luo et al. proposed an ode-based sir-model 1 available at covidsim.eu all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. to predict the end of the covid-19 pandemic 2 , which is regressed with daily updated data. ode-models have also been used to project the epidemic dynamics into the "postpandemic" future by kissler et al. [27] . some groups also reside to branching processes, which are inherently stochastic but not based on a complex interaction structure [21, 42] . a very popular class of epidemic models is based on the assumption that during an epidemic individuals are either susceptible (s), infected (i), or recovered/removed (r). the mean number of individuals in each compartment evolves according to the following system of ordinary differential equations where n denotes the total population size, λ ode and β are the infection and recovery rates. typically, one assumes that n = 1 in which case the equation refers to fractions of the population, leading to the invariance s(t)+i(t)+r(t) = 1 for all t. it is trivial to extent the compartments and transitions. a stochastic network-based spreading model is a continuous-time stochastic process on a discrete state space. the underlying structure is given by a graph, where 2 available at ddi.sutd.edu.sg all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. . https://doi.org/10.1101/2020.05.05.20091736 doi: medrxiv preprint each node represents one individual (or other entities of interest). at each point in time, each node occupies a compartment, for instance: s, i, or r. moreover, nodes can only receive or transmit infections from neighboring nodes (according to the edges of the graph). for the general case with m possible compartments, this yields a state space of size m n , where n is the number of nodes. the jump times until events happen are typically assumed to follow an exponential distribution. note that in the ode model, residual residence times in the compartments are not tracked, which naturally corresponds to the exponential distribution in the network model. hence, the underlying stochastic process is a continuous-time markov chain (ctmc) [26] . the extension to non-markovian semantics is trivial. we illustrate the three-compartment case in fig. 1 . the transition rates of the ctmc are such that an infected node transmits infections at rate λ. hence, the rate at which a susceptible node is infected is λ·#neigh(i), where #neigh(i) is the number of its infected direct neighbours. spontaneous recovery of a node occurs at rate β. the size of the state space renders a full solution of the model infeasible and approximations of the mean-field [14] or monte-carlo simulations are common ways to analyze the process. general differences to the ode model. this aforementioned formalism yields some fundamental differences to network-free ode-based approaches. the most distinct difference to network-free ode-based approaches is the decoupling of infectiousness and interaction structure. the infectiousness λ (i.e. the infection rate) is assumed to be a parameter expressing how contagious a pathogen inherently is. it encodes the probability of a virus transmission if two people meet. that is, it is independent from the social interactions of individuals (it might however depend on hygiene, masks, etc.). the influence of social contacts is expressed in the (potentially time-varying) connectivity of the graph. loosely speaking, it encodes the possibility that two individuals meet. in the ode-approach both are combined in the basic reproduction number. note that, throughout this manuscript, we use λ to denote the infectiousness of covid-19 (as an instantaneous transmission rate). another important difference is that ode-models consider fractions of individuals in each compartment. in the network-based paradigm, we model absolute numbers of entities in each compartment and extinction of the epidemic may happen with positive probability. while ode-models are agnostic to the actual population size, in network-based models, increasing the population by adding more nodes inevitably changes the dynamics. another important connection between the two paradigms is that if the network topology is a complete graph (resp. clique) then the ode-model gives an accurate approximation of the expected fractions of the network-based model. in systems biology this assumption is often referred to as well-stirredness. in the limit of an infinite graph size, the approximation approaches the true mean. all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. to transform an ode-model to a network-based model, one can simply keep rates relating to spontaneous transitions between compartments as these transitions do not depend on interactions (e.g., recovery at rate β or an exposed node becoming infected). translating the infection rate is more complicated. in odemodels, one typically has given an infection rate and assumes that each infected individual can infect susceptible ones. to make the model invariant to the actual number of individuals, one typically divides the rate by the population size (or assumes the population size is one and the odes express fractions). naturally, in a contact network, we do not work with fractions but each node relates to one entity. here, we propose to choose an infection rate such that the network-based model yields the same basic reproduction number r 0 as the ode-model. the basic reproduction number describes the (excepted) number of individuals that an infected person infects in a completely susceptible population. we calibrate our model to this starting point of the spreading process, where there is a single infected node (patient zero). we assume that r 0 is either explicitly given or can implicitly be derived from an ode-based model specification. hence, when we pick a random node as patient zero, we want it to infect on average r 0 susceptible neighbors (all neighbors are susceptible at that point in time) before it recovers or dies. let us assume that, like in aforementioned sir-model, infectious node becomes infect their susceptible neighbors with rate λ and that an infectious node loses its infectiousness (by dying, recovering, or quarantining) with rate β. according to the underlying ctmc semantics of the network model, each susceptible neighbor gets infected with probability λ β+λ [26] . note that we only take direct infections form patient zero into account and, for simplicity, assume all neighbors are only infected by patient zero. hence, when patient zero has k neighbors, the expected number of neighbors it infects is k λ β+λ . since the mean degree of the network is k mean , the expected number of nodes infected by patent zero is now we can calibrate λ to relate to any desired r 0 . that is note that we generally assume that r 0 > 1 and that no isolates (nodes with no neighbors) exists in the graph, which implies k mean ≥ 1. hence, by construction, it is not possible to have an r 0 which is larger than (or equal to) the average number of neighbors in the network. in contrast, in the deterministic paradigm this relationship is given by the equation (cf. [29, 9] ): note that the recovery rate β is identical in the ode-and network-model. we can translate the infection rate of an ode-model to a corresponding networkbased stochastic model with the equation while keeping r 0 fixed. in the limit of an infinite complete network, this yields to lim n→∞ λ = λode n , which is equivalent to the effective infection rate in the ode-model λode n for population size n (cf. eq. (1) ). example. consider a network where each node has exactly 5 neighbors (a 5regular graph) and let r 0 = 2. we also assume that the recovery rate is β = 1, which then yields λ ode = 2. the probability that a random neighbor of patient zero becomes infected is 2 5 = λ (β+λ) , which gives λ = 2 3 . it is trivial to extent the compartments and transitions, for instance by including an exposed compartment for the time-period where an individual is infected but not yet infectious. the derivation of r 0 remains the same. the only the only requirement is the existence of a distinct infection and recovery rate, respectively. in the next section, we discuss a more complex case. we consider a network-based model that is strongly inspired by the ode-model used in [25] , we document it in fig. 2 . we use the same compartments and transition-types but simplify the notation compared to [25] to make the intuitive meaning of the variables clearer 3 . we denote the compartments by c = {s, e, c, i, h, u, r, d}, where each node can be susceptible(s), exposed (e), a carrier (c), infected (i), hospitalized (h), in the (intensive care unit (u), dead (d), or recovered (r). exposed agents are already infected but symptom-free and not infectious. carriers are also symptomfree but already infectious. infected nodes show symptoms and are infectious. therefore, we assume that their infectiousness is reduced by a factor of γ (γ ≤ 1, sick people will reduce their social activity). individuals that are hospitalized (or in the icu) are assumed to be properly quarantined and cannot infect others. note that accurate spreading parameters are very difficult to infer in general and the high number of undetected cases complicate the problem further in the current pandemic. here, we choose values that are within the ranges listed in [25] , where the ranges are rigorously discussed and justified. we document them in table 1 . we remark that there is a high amount of uncertainty in the spreading parameters. however, our goal is not a rigorous fit to data but rather a comparison of network-free ode-models to stochastic models with an underlying network structure. note that the mean number of days in a compartment is the inverse of the cumulative instantaneous rate to leave that compartment. for instance, the mean residence time in compartment h is as a consequence of the race condition of the exponential distribution [47] , r h modulates the probability of entering the successor compartment. that is, with probability r h , the successor compartment will be r and not u. inferring the infection rate λ for a fixed r 0 is somewhat more complex than in the previous section because this model admits two compartments for infectious agents. we first consider the expected number of nodes that a randomly chosen patient zero infects, while being in state c. we denote the corresponding basic reproduction number by r 0 . we calibrate the only unknown parameter λ accordingly (the relationships from the previous section remain valid). we explain the relation to r 0 when taking c and i into account in appendix a. substituting β by µ c gives naturally, it is extremely challenging to reconstruct large-scale contact-networks based on data. here, we test different types of contact networks with different features, which are likely to resemble important real-world characteristics. the contact networks are specific realizations (i.e. variates) of random graph models. different graph models highlight different (potential) features of the real-world interaction structure. the number of nodes ranges from 100 to 10 5 . we only use strongly connected networks (where each node is reachable from all other nodes). we refer to [10] or the networkx [18] documentation for further information about the network models discussed in the sequel. we provide a schematic visualization in fig. 3 . we consider erdős-rényi (er) random graphs as a baseline, where each pair of nodes is connected with a certain (fixed) probability. we also compute results for watts-strogatz (ws) random networks. they are based on a ring topology with random re-wiring. the re-wiring yields to a small-world property of the network. colloquially, this means that one can reach each node from each other node with a small number of steps (even when the number of nodes increases). we further consider geometric random networks (gn), where nodes are randomly sampled in an euclidean space and randomly connected such that nodes closer to each other have a higher connection probability. we also consider barabási-albert (ba) random graphs that are generated using a preferential attachment mechanism among nodes and graphs generated using the configuration model (cm-pl) which are-except from being constrained all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. . https://doi.org/10.1101/2020.05.05.20091736 doi: medrxiv preprint λ · #neigh(c) + λγ · #neigh(i) rate of transitioning from e to c rc 0.08 recovery probability when node is a carrier µc rate of leaving c ri 0.8 recovery probability when node is infected µi 1 5 rate of leaving i r h 0.74 recovery probability when node is hospitalized µ h rate of leaving h ru 0.46 recovery probability when node is in the icu µu 1 8 rate of leaving u all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. on having power-law degree distribution-completely random. both models contain a very small number of nodes with very high degree, which act as superspreaders. we also test a synthetically generated household (hh) network that was loosely inspired by [2] . each household is a clique, the edges between households represent connections stemming from work, education, shopping, leisure, etc. we use a configuration model to generate the global inter-household structure that follows a power-law distribution. we also use a complete graph (cg) as a sanity check. it allows the extinction of the epidemic, but otherwise similar results to those of the ode are expected. we are interested in the relationship between the contact network structure, r 0 , the height and time point of the infection-peak, and the number of individuals ultimately affected by the epidemic. therefore, we run different network models with different r 0 (which is equivalent to fixing the corresponding values for λ or for r 0 ). for one series of experiments, we fix r 0 = 1.8 and derive the corresponding infection rate λ and the value for λ ode in the ode model. in the second experiments, calibrate λ and λ ode such that all infection peaks lie on the same level. in the sequel, we do not explicitly model npis. however, we note that the network-based paradigm makes it intuitive to distinguish between npis related to the probability that people meet (by changing the contact network) and npis all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. . https://doi.org/10.1101/2020.05.05.20091736 doi: medrxiv preprint related to the probability of a transmission happening when two people meet (by changing the infection rate λ). political decision-making is faced with the challenge of transforming a network structure which inherently supports covid-19 spreading to one which tends to suppress it. here, we investigate how changes in λ affect the dynamics of the epidemic in section 5 (experiment 3). we compare the solution of the ode model (using numerical integration) with the solution of the corresponding stochastic network-based model (using monte-carlo simulations). code will be made available 4 . we investigate the evolution of mean fractions in each compartment over time, the evolution of the so-called effective reproduction number, and the influence of the infectiousness λ. setup. we used contact networks with n = 1000 nodes (except for the complete graph where we used 100 nodes). to generate samples of the stochastic spreading process, we utilized event-driven simulation (similar to the rejection-free version in [16] ). the simulation started with three random seeds nodes in compartment c (and with an initial fraction of 3/1000 for the ode model). one thousand simulation runs were performed on a fixed variate of a random graph. we remark that results for other variates were very similar. hence, for better comparability, we refrained from taking an average over the random graphs. the parameters to generate a graph are: er: k mean = 6, ws: k = 4 (numbers of neighbors), p = 0.2 (re-wire probability), gn: r = 0.1 (radius), ba: m = 2 (number of nodes for attachment), cm-pl: γ = 2.0 (power-law parameter) , k min = 2, hh: household size is 4, global network is cm-pl with γ = 2.0, k min = 3. experiment 1: results with homogeneous r 0 . in our first experiment, we compare the epidemic's evolution (cf. fig. 4 ) while λ is calibrated such that all networks admit a r 0 of 1.8. and λ is set (w.r.t. the mean degree) according to eq. (6). thereby, we analyze how well certain network structures generally support the spread of covid-19. the evolution of the mean fraction of nodes in each compartment is illustrated in fig. 4 and fig. 5 . based on the monte-carlo simulations, we analyzed how the number r t of neighbors that an infectious node infects changes over time (cf. fig. 6 ). hence, r t is the effective reproduction number at day t (conditioned on the survival of the epidemic until t). for t = 0, the estimated effective reproduction number always starts around the same value and matched the theoretical prediction. independent of the network, r 0 = 1.8 yields r 0 ≈ 2.05 (cf. appendix a). in fig. 6 we see that the evolution of r t differs tremendously for different contact networks. unsurprisingly, r t decreases on the complete graph (cg), as nodes that become infectious later will not infect more of their neighbors. this 4 github.com/gerritgr/stochasticnetworkedcovid19 all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. also happens for gn-and ws-networks, but they cause a much slower decline of r t which is around 1 in most parts (the sharp decrease in the end stems from the end of the simulation being reached). this indicates that the epidemic slowly "burns" through the network. in contrast, in networks that admit super-spreaders (cm-pl, hh, and also ba), it is principally possible for r t to increase. for the cm-pl network, we have a very early and intense peak of the infection while the number of individuals ultimately affected by the virus (and consequently the death toll) remains comparably small (when we remove the super-spreaders from the network while keeping the same r 0 , the death toll and the time point of the peak increase, plot not shown). note that the high value of r t in fig. 6 c) in the first days results from the fact that super-spreaders become exposed, which later infect a large number of individuals. as there are very few super-spreaders, they are unlikely to be part of the seeds. however, due to their high centrality, they are likely to be one of the first exposed nodes, leading to an "explosion" of the epidemic. in hh-networks this effect is way more subtle but follows the same principle. experiment 2: calibrating r 0 to a fixed peak. next, we calibrate λ such that each network admits an infection peak (regarding i total ) of the same height (0.2). results are shown in fig. 7 . they emphasize that there is no direct relationship between the number of individuals affected by the epidemic and the height of the infection peak, which is particularly relevant in the light of limited icu capacities. it also shows that vastly different infection rates and basic reproduction numbers are acceptable when aiming at keeping the peak below a certain threshold. all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. x-axis: day at which a node becomes exposed, y-axis: (mean) number of neighbors this node infects while being a carrier or infected. note that at later time points results are more noisy as the number of samples decreases. the first data-point is the simulation-based estimation of r 0 and is shown as a blue square. fig. 8 . noticeably, the relationship is concave for most network models but almost linear for the ode model. this indicates that the networks models are more sensitive to small changes of λ (and r 0 ). this suggests that the use of ode models might lead to a misleading sense of confidence because, roughly speaking, it will tend to yield similar results when adding some noise to λ. that makes them seemingly robust to uncertainty in the parameters, while in reality the process is much less robust. assuming that ba-networks resemble some important features of real social networks, the non-linear relationship between infection peak and infectiousness indicates that small changes of λ, which could be achieved through proper hand-all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. for the network-models, r 0 (given by eq. (7)) is sown as a scatter plot. note the different scales on x-and y-axis washing, wearing masks in public, and keeping distance to others, can significantly "flatten the curve". in the series of experiments, we tested how various network types influence an epidemic's dynamics. the networks types highlight different potential features of real-world social networks. most results do not contradict line with real-world observations. for instance, we found that better hygiene and the truncation of super-spreaders will likely reduce the peak of an epidemic by a large amount. we also observed that, even when r 0 is fixed, the evolution of r t largely depends on the network structure. for certain networks, in particular those admitting super-spreaders, it can even increase. an increasing reproduction number can be seen in many countries, for instance in germany [20] . how much of this can be attributed to super-spreaders is still being researched. note that superspreaders do not necessarily have to correspond to certain individuals. it can also, on a more abstract level, refer to a type of events. we also observed that cm-pl networks have a very early and very intense infection peak. however, the number of people ultimately affected (and therefore also the death toll) remain comparably small. this is somewhat surprising and requires further research. we speculate that the fragmentation in the network makes it difficult for the virus to "reach every corner" of the graph while it "burns out" relatively quickly for the high-degree nodes. we presented results for a covid-19 case study that is based on the translation of an ode model to a stochastic network-based setting. we compared several interaction structures using contact graphs where one was (a finite version of the) the implicit underlying structure of the ode model, the complete graph. we found that inhomogeneity in the interaction structure significantly shapes the epidemic's dynamic. this indicates that fitting deterministic ode models to real-world data might lead to qualitatively and quantitatively wrong results. the interaction structure should be included into computational models and should undergo the same rigorous scientific discussion as other model parameters. contact graphs have the advantage of encoding various types of interaction structures (spatial, social, etc.) and they decouple the infectiousness from the connectivity. we found that the choice of the network structure has a significant impact and it is very likely that this is also the case for the inhomogeneous interaction structure among humans. specifically, networks containing super-spreaders consistently lead to the emergence of an earlier and higher peak of the infection. moreover, the almost linear relationship between r 0 , λ ode , and the peak intensity in ode-models might also lead to misplaced confidence in the results. regarding the network structure in general, we find that super-spreaders can lead to a very early "explosion" of the epidemic. small-worldness, by itself, does not admit this property. generally, it seems that-unsurprisingly-a geometric network is best at containing a pandemic. this would imply evidence for corresponding mobility restrictions. surprisingly, we found a trade-off between the height of the infection peak and the fraction of individuals affected by the epidemic in total. for future work, it would be interesting to investigate the influence of non-markovian dynamics. ode-models naturally correspond to an exponentially distributed residence times in each compartment [48, 16] . moreover, it would be interesting to reconstruct more realistic contact networks. they would allow to investigate the effect of npis in the network-based paradigm and to have a well-founded scientific discussion about their efficiency. from a risk-assessment perspective, it would also be interesting to focus more explicitly on worst-case trajectories (taking the model's inherent stochasticity into account). this is especially relevant because the costs to society do not scale linearly with the characteristic values of an epidemic. for instance, when icu capacities are reached, a small additional number of severe cases might lead to dramatic consequences. all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. the reachability probability from s − c to e − c is related to r 0 . it expresses the probability that a infected node (patient zero, in c) infects a specific random (susceptible) neighbor. the infection can happen via two paths. furthermore, we assume that this happens for all edges/neighbors of patient zero independently. assume a randomly chosen patient zero that is in compartment c. we are interested in r 0 in the model given in fig. 2 assuming γ > 0. again, we consider each neighbor independently and multiply with k mean . moreover, we have to consider the likelihood that patient zero infects a neighbor while being in compartment c and the possibility of transitioning to i and then transmitting the virus. this can be expressed as a reachability probability (cf. fig. 9 ) and gives raise to the equation: in the brackets, the first part of the sum expresses the probability that patient zero infects a random neighbor as long as it is in c. in the second part of the sum, the first factor expresses the probability that patient zero transitions to i before infecting a random neighbor. the second factor is then the probability of infecting a random neighbor as long as being in i. note that, as we consider a fixed random neighbor, we need to condition the second part of the sum on the fact that the neighbor was not already infected in the first step. all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. . https://doi.org/10.1101/2020.05.05.20091736 doi: medrxiv preprint derivation of the effective reproduction number r for covid-19 in relation to mobility restrictions and confinement analysis of a stochastic sir epidemic on a random network incorporating household structure generation and analysis of large synthetic social contact networks epidemiology and transmission of covid-19 in shenzhen china: analysis of 391 cases and 1,286 of their close contacts controlling contact network topology to prevent measles outbreaks mitigation and herd immunity strategy for covid-19 is likely to fail modellierung von beispielszenarien der sars-cov-2-ausbreitung und schwere in deutschland the effect of travel restrictions on the spread of the 2019 novel coronavirus inferring covid-19 spreading rates and potential change points for case number forecasts a first course in network theory strategies for mitigating an influenza pandemic community transmission of sars-cov-2 at two family gatherings-chicago, illinois binary-state dynamics on complex networks: pair approximation and beyond mathematical models of infectious disease transmission rejection-based simulation of nonmarkovian agents on complex networks epidemic spreading in urban areas using agent-based transportation models exploring network structure, dynamics, and function using networkx modeling targeted layered containment of an influenza pandemic in the united states schätzung der aktuellen entwicklung der sars-cov-2-epidemie in deutschland-nowcasting feasibility of controlling covid-19 outbreaks by isolation of cases and contacts representations of human contact patterns and outbreak diversity in sir epidemics insights into the transmission of respiratory infectious diseases through empirical human contact networks coronavirus disease 2019: the harms of exaggerated information and non-evidence-based measures estimate of the development of the epidemic reproduction number rt from coronavirus sars-cov-2 case data and implications for political measures based on prognostics mathematics of epidemics on networks projecting the transmission dynamics of sars-cov-2 through the post-pandemic period contacts in context: large-scale setting-specific social mixing matrices from the bbc pandemic project fundamental principles of epidemic spread highlight the immediate need for large-scale serological surveys to assess the stage of the sars-cov-2 epidemic an infectious disease model on empirical networks of human contact: bridging the gap between dynamic network data and contact matrices temporal network epidemiology comparison of three methods for ascertainment of contact information relevant to respiratory pathogen transmission in encounter networks a small community model for the transmission of infectious diseases: comparison of school closure as an intervention in individual-based models of an influenza pandemic analysis and control of epidemics: a survey of spreading processes on complex networks epidemic processes in complex networks an agent-based approach for modeling dynamics of contagious disease spread threshold conditions for arbitrary cascade models on arbitrary networks optimal vaccine allocation to control epidemic outbreaks in arbitrary networks the effect of control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan, china: a modelling study investigation of three clusters of covid-19 in singapore: implications for surveillance and response measures crowding and the epidemic intensity of covid-19 transmission. medrxiv pattern of early human-to-human transmission of wuhan 2019 novel coronavirus (2019-ncov) a high-resolution human contact network for infectious disease transmission spreading processes in multilayer networks interaction data from the copenhagen networks study impact of temporal scales and recurrent mobility patterns on the unfolding of epidemics probability, markov chains, queues, and simulation: the mathematical basis of performance modeling non-markovian infection spread dramatically alters the susceptible-infected-susceptible epidemic threshold in networks an introduction to infectious disease modelling modelling the potential health impact of the covid-19 pandemic on a hypothetical european country all rights reserved. no reuse allowed without permission. key: cord-350646-7soxjnnk authors: becker, sara; chaple, michael; freese, tom; hagle, holly; henry, maxine; koutsenok, igor; krom, laurie; martin, rosemarie; molfenter, todd; powell, kristen; roget, nancy; saunders, laura; velez, isa; yanez, ruth title: virtual reality for behavioral health workforce development in the era of covid-19 date: 2020-10-09 journal: j subst abuse treat doi: 10.1016/j.jsat.2020.108157 sha: doc_id: 350646 cord_uid: 7soxjnnk the coronavirus 2019 disease (covid-19) pandemic emerged at a time of substantial investment in the united states substance use service infrastructure. a key component of this fiscal investment was funding for training and technical assistance (ta) from the substance abuse and mental health services administration (samhsa) to newly configured technology transfer centers (ttcs), including the addiction ttcs (attc network), prevention ttcs (pttc network), and the mental health ttcs (mhttc network). samhsa charges ttcs with building the capacity of the behavioral health workforce to provide evidence-based interventions via locally and culturally responsive training and ta. this commentary describes how, in the wake of the covid-19 pandemic, ttcs rapidly adapted to ensure that the behavioral health workforce had continuous access to remote training and technical assistance. ttcs use a conceptual framework that differentiates among three types of technical assistance: basic, targeted, and intensive. we define each of these types of ta and provide case examples to describe novel strategies that the ttcs used to shift an entire continuum of capacity building activities to remote platforms. examples of innovations include online listening sessions, virtual process walkthroughs, and remote “live” supervision. ongoing evaluation is needed to determine whether virtual ta delivery is as effective as face-to-face delivery or whether a mix of virtual and face-to-face delivery is optimal. the ttcs will need to carefully balance the benefits and challenges associated with rapid virtualization of ta services to design the ideal hybrid delivery model following the pandemic. the coronavirus 2019 disease pandemic emerged at a time of substantial investment in the united states substance use service infrastructure. between 2017 and 2019, congress released $3.3 billion dollars in grants to scale up substance use prevention, treatment, and recovery efforts in an attempt to curtail the overdose epidemic (goodnough, 2019) . a key component of this fiscal investment was funding for training and technical assistance (ta) from the substance abuse and mental health services administration (samhsa) to newly configured technology transfer centers (ttcs), including the addiction ttcs (attc network), prevention ttcs (pttc network), and mental health ttcs (mhttc network). to ensure the modernization of the behavioral health service system, samhsa charges ttcs with building the capacity of the behavioral health workforce to provide evidence-based interventions via locally and culturally responsive training and ta (katz, 2018) . in march 2020, the covid-19 pandemic upended the united states healthcare system, and challenged the behavioral health workforce in unprecedented ways. to meet the needs of the workforce, ttcs had to rapidly innovate to provide training and ta without service disruption. ttcs apply different ta strategies based on circumstances, need, and appropriateness (powell, 2015) and consider training (i.e., conducting educational meetings) as a discrete activity that can be provided as part of any ta effort. ttcs are guided by extensive evidence that strategies beyond training are required for practice implementation and organizational change (edmunds et al., 2013) , underscoring the critical need for virtual ta in the wake of the covid-19 pandemic. in may 2020, we surveyed all 39 u.s.-based ttcs to identify example innovations in each layer of the ta pyramid that the covid-19 necessitated. thirty-five ttcs (90%) across three networks (pttc n=13; attc n=13; mhttc n=9) responded, representing both regional and national ttcs. consultations. tccs typically deliver basic ta to large audiences and focus on building awareness and knowledge. common basic ta activities for untargeted audiences include conferences, brief consultation, and web-based lectures (i.e., webinars). ttcs reported a surge in requests for basic ta during the covid-19 pandemic and responded with a significant increase in dissemination of information (i.e., best practice guidelines), as well as brief consultations to support interpretation of such information. ttcs emphasized virtual content curation, organizing content to enhance usability. additionally, ttcs employed novel delivery channels, such as live streaming, pre-recorded videos, podcasts, and webinars with live transcription, to reach wide audiences. another practice innovation was online listening sessions in which health professionals convened around a priority topic. for instance, two national ttcs co-hosted a j o u r n a l p r e -p r o o f journal pre-proof virtual workforce development 6 series of listening sessions titled "emerging issues around covid-19 and social determinants of health" that experimented with "flipping the typical script" by first having participants engage in conversation and then having expert presenters address emergent topics via brief didactics. this series, which was not sequential or interconnected, built knowledge and awareness around evolving workforce needs. targeted ta is the provision of directed training or support to specific groups (e.g., clinical supervisors) or organizations (e.g., prevention coalitions) focused on building skill and promoting behavior change. targeted ta encompasses activities customized for specific recipients such as didactic workshop trainings, learning communities, and communities-of-practice. due to the focus on provider skill-building, targeted ta often relies on experiential learning activities such as role plays and behavioral rehearsal (edmunds et al., 2013) . to transition targeted ta online, ttcs reduced didactic material to the minimum necessary; spread content over several sessions; and leveraged technology to foster interaction among small groups. for example, one regional ttc transformed a face-to-face, multi-day motivational interviewing skills-building series by moving the delivery to a multi-week virtual learning series. this ttc kept participants engaged by limiting the time for each session to 1-2 hours, utilizing the full capabilities of videoconferencing platforms (e.g., small breakout rooms and interactive polling), and extending learning through sms text messages containing reminders of core skills. covid-net: a weekly summary of u.s. hospitalization data coronavirus disease 2019 (covid-19): cases in the the dynamic sustainability framework: addressing the paradox of sustainment amid ongoing change dissemination and implementation of evidence-based practices: training and consultation as implementation strategies implementation: the missing link between research and practice states are making progress on opioids. now the money that is helping them may dry up drug overdose deaths drop in u.s. for the first time since 1990 the substance abuse and mental health services administration key: cord-338588-rc1h4drd authors: li, xuanyi; sigworth, elizabeth a.; wu, adrianne h.; behrens, jess; etemad, shervin a.; nagpal, seema; go, ronald s.; wuichet, kristin; chen, eddy j.; rubinstein, samuel m.; venepalli, neeta k.; tillman, benjamin f.; cowan, andrew j.; schoen, martin w.; malty, andrew; greer, john p.; fernandes, hermina d.; seifter, ari; chen, qingxia; chowdhery, rozina a.; mohan, sanjay r.; dewdney, summer b.; osterman, travis; ambinder, edward p.; buchbinder, elizabeth i.; schwartz, candice; abraham, ivy; rioth, matthew j.; singh, naina; sharma, sanjai; gibson, michael k.; yang, peter c.; warner, jeremy l. title: seven decades of chemotherapy clinical trials: a pan-cancer social network analysis date: 2020-10-16 journal: sci rep doi: 10.1038/s41598-020-73466-6 sha: doc_id: 338588 cord_uid: rc1h4drd clinical trials establish the standard of cancer care, yet the evolution and characteristics of the social dynamics between the people conducting this work remain understudied. we performed a social network analysis of authors publishing chemotherapy-based prospective trials from 1946 to 2018 to understand how social influences, including the role of gender, have influenced the growth and development of this network, which has expanded exponentially from fewer than 50 authors in 1946 to 29,197 in 2018. while 99.4% of authors were directly or indirectly connected by 2018, our results indicate a tendency to predominantly connect with others in the same or similar fields, as well as an increasing disparity in author impact and number of connections. scale-free effects were evident, with small numbers of individuals having disproportionate impact. women were under-represented and likelier to have lower impact, shorter productive periods (p < 0.001 for both comparisons), less centrality, and a greater proportion of co-authors in their same subspecialty. the past 30 years were characterized by a trend towards increased authorship by women, with new author parity anticipated in 2032. the network of cancer clinical trialists is best characterized as strategic or mixed-motive, with cooperative and competitive elements influencing its appearance. network effects such as low centrality, which may limit access to high-profile individuals, likely contribute to the observed disparities. the modern era of chemotherapy began in 1946, with publications describing therapeutic uses of nitrogen mustard 1, 2 . over the next 70 years, the repertoire of available cancer treatments has expanded at an ever-increasing pace. chemotherapeutics have a notably low therapeutic index, i.e., the difference between a harmful and beneficial dose or combination is often quite small 3 . consequently, a complex international clinical trial apparatus emerged in the 1970s to study chemotherapeutics in controlled settings, and prospective clinical trials remain the gold standard by which standard of care treatments are established 4, 5 . discoveries made by successive generations have led to overall improvement in the prognosis of most cancers 6 . while social network analysis has been used to study patterns of co-authorship in scientific settings 7, 8 , the social component of clinical trial research is not well characterized. little is known about how social factors have shaped the progress of the field, as cancer care has become increasingly subspecialized, and how social network baseline characteristics. n = 5599 of 6301 reviewed publications with an aggregate of n = 29,197 authors met the inclusion criteria (consort figure s1 ). cumulatively, most authors in the network (n = 22,761, 78%) published at least one randomized trial, with n = 15,340 (52.5%) participating in the publication of a "positive" trial (table s2 ). most of the included authors (n = 28,087, 96.2%) participated in the primary publication of a clinical trial, while a smaller subgroup (n = 6,773, 23.2%) participated in the publication of updates. the most common venues for publication were high-impact clinical journals: the journal of clinical oncology (n = 1595, 28 .5%), the lancet family (n = 710, 12.7%), the new england journal of medicine (n = 495, 8.8%), and the blood family (n = 495, 8.8%). co-authorship has changed in a non-linear fashion over time: the median number of authors per publication increased from n = 6 in 1946 to n = 20 (iqr [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] in 2018 ( figure s2 ). across subspecialties, the median number of co-authors per publication varied somewhat, from a low of n = 10 (iqr 7-15) in gynecologic oncology to a high of n = 16 (iqr [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] in dermatologic oncology. median longevity is < 1 year at all times, although the number of authors with multiple years in the field grows substantially over time ( figure s3 ). a small number of individuals maintained the highest impact over time-nearly 20 years each in the case of chemotherapy pioneers sidney farber and james f. holland ( figure s4 ). in any given year, most authors had a betweenness centrality of < 1% of the maximum; conversely, a very small number of authors had an exceptionally high score, with 1% of authors accounting for 100% of the total in recent years ( figure s5 ). accordingly, an increasingly smaller proportion of authors were both very highly connected and highly impactful; in 1970, the 10% highest-impact authors (n = 20) account for 21.4% of links and 54.9% of impact; in 2018, the same proportion (n = 2920) account for 37.1% of links and 62.3% of impact. first/last authorship has also become concentrated; in 2018 publications, 10% of authors had at least one such role, whereas prior to 1980 it was on average > 25% ( figure s6 ). the structure of the network changes considerably over time, from relatively dense and connected to sparse and modular (fig. 1b) . the final network is very sparse (0.16% of possible links are present); nevertheless, n = 29,029 (99.4%) authors are in a single connected component; the next-largest component comprises 14 authors. each of the 13 cancer subspecialties developed at different rates, with clear influence of seminal events in several subspecialties, e.g., the introduction of adjuvant therapy and tamoxifen for breast cancer, completely new classes of drugs for plasma cell disorders, and immunotherapy for melanoma (fig. 1c) 25-31 . network visualization and cumulative metrics. the final cumulative network visualization is shown in figs. 2 & s7. the impact score of authors is unevenly distributed, median 0.0532 (range 0-14.31); however, the log-transformed impact scores approximate a normal distribution ( figure s8 ). authors with longevity ≥ 1 year who changed primary subspecialty at least once (n = 2330) had nearly twice the median impact and longevity of those who remained in one subspecialty (n = 10,276), 0.25 (iqr 0.11-0.6) versus 0.14 (iqr 0.07-0.35) and 13 years (iqr 6-19) versus 7 years (iqr 3-12), respectively (p < 0.001 for both comparisons). cumulatively, subspecialized authors with calculable homophily (n = 24,560) have a median proportion of co-authors sharing the same subspecialty of 88% (iqr 76-95%); 945,167 (71.4%) of these authors' outlinks are within-subspecialty. this is reflected by a high assortativity by subspecialty since the mid-1960s (fig. 1b) modularity follows a sigmoid pattern with a period of linear increase between 1960-80 followed by a plateau at high modularity; assortativity rapidly increases in early decades; median normalized pagerank decreases to a low plateau from the 1970s onward; (c) subspecialties develop at different but broadly parallel rates, with seminal events apparently preceding accelerations of individual subspecialties, e.g.,: (1) in the four years after 1973, combination therapy (ac 25 ), adjuvant therapy 26 , and tamoxifen 27 were introduced in breast cancer; (2) thalidomide 28 and bortezomib 29 were reported to be efficacious for multiple myeloma; and (3) immunotherapy (ipilimumab 30, 31 ) was introduced in the treatment of melanoma. www.nature.com/scientificreports/ sensitivity analysis. normalized score distributions did not change significantly, although modulation of the trial design coefficient led to a bimodal peak ( figure s11 ). correlation of assortativity and modularity was high, ranging from 0.815-0.999 for the former and 0.981-0.999 for the latter (table s3 ; figure s12 ). the remarkable gains in the fields of hematology and oncology can be ascribed to the tireless work of numerous trialists and the generosity of countless patient participants. as a result, systemic antineoplastics now stand beside surgery and radiotherapy as a pillar of cancer care. our analysis of clinical trialists as a social network, particularly with respect to the density distribution of pagerank, reveals a mixed-motive network that differs only authors assigned to a subspecialty are visualized; these account for 84% of all authors in the database. this figure highlights various clustering trends by subspecialty, such as the apparent sub-clusters of sarcoma research (yellow) and the two dominant clusters of breast cancer research (pink). it is clear as well that certain subspecialties are more cohesive than others, such as the tightly clustered dermatology (black) compared to the spread-out head and neck cancer authors (red). www.nature.com/scientificreports/ substantially from "collegial" and "friend-based" online social networks. while clinical trials are conducted towards a collaborative goal-improved outcomes for all cancer patients-there are significant competitive pressures. examples of these pressures include resource limitations (e.g., funding and patients available for accrual), the tension between prioritization of cooperative group versus industry-funded trials, personal motivations such as academic promotion or leadership opportunities, and institutional reputation. the emergence of formal and informal leaders in scientific networks has been shown to facilitate research as well as create clusters 32 . as fig. 2 shows, there is a strong tendency for clustering based on subspecialty in the complete network, although some subspecialties (e.g., lymphoid and myeloid malignancies) have many more interconnections than others (e.g., sarcoma and neuro-oncology). many of these clusters appear to be organized around an individual or group of individuals who have high impact and centrality. as an organizational principle, these individuals appear to rarely be in direct competition, but their presence is a clear indicator of scale-free phenomena within the network. the facts that betweenness centrality follows a power law cumulative distribution bolsters this theory. scale-free phenomena, which are defined by a power law distribution of connectedness, are very common in strategic networks, especially when they become increasingly sparse, as this network does 33 . the two related theories for this network behavior are preferential attachment and fitness. the former observes that those with impact tend to attract more impact; the latter postulates that such gains for the "fittest" come at the expense of the "less fit" 34 . seminal events (fig. 1c) are likely a driver of preferential attachment 35 , and may the network is overwhelmingly dominated by men until 1980, when a trend towards increasing authorship by women begins to be seen; however, representation by women in first/last authorship remains low; gray shaded lines are 95% confidence intervals of the loess curves; (b) men tend on average to have a longer productive period and to achieve a higher author impact score than women (p < 0.001 for both comparisons); (c) men tend on average to be more central and have more collaborations outside of their subspecialty. note that the homophily calculation requires a subspecialty assignment, which explains the slightly lower numbers in (c) as compared to (b). www.nature.com/scientificreports/ partially explain why some authors change their primary subspecialty at least once over time (e.g., through a "bandwagon" effect driven by the diffusion of ideas 36 ). given that these authors were observed to have nearly twice the impact and longevity of their single subspecialty peers, this dynamic will be a focus of future study, including calculation of the q factor, a metric developed to quantify the ability of a scientist to "take advantage of the available knowledge in a way that enhances (q > 1) or diminishes (q < 1) the potential impact p of a paper" 37 . in the analysis of network dynamics (fig. 1b) , the field as a whole appears to emerge in the 1970s, which is also when medical oncology and hematology were formally recognized through board certification. measurements of field maturity are by their nature subjective, but the pessimism 38 of the late 1960s was captured by sidney farber: "…the anticancer chemicals, hormones, and the antibiotics…marked the beginning of the era of chemotherapy of cancer, which may be described after 20 years as disappointing because progress has not been more rapid…" 39 . these concerns prompted the us national cancer act of 1971, which was followed by the leveling of modularity at a very high level from 1976 onwards, suggesting that the subspecialties generated in the 1970s have remained stable. the assortativity by subspecialty has increased as well, with recent levels approximately twice those seen in a co-authorship network of physicists 20 . while median pagerank has decreased markedly, indicating decreasing influence for the average author, the distribution in 2018 is broadly right-skewed ( figure s13 ). these findings reveal a high level of increasing exclusivity, suggesting that it is becoming progressively more difficult to join the top echelon of the network. this has major implications for junior investigators' mobility, and potentially for the continued health of the network as a whole. while there is much to be applauded in the continued success of translating research findings into the clinic, we observed clear gender disparities within the cancer clinical trialist network: women have a statistically significantly lower final impact score, shorter productive period, less centrality, and less collaboration with those outside of their primary subspecialty. these findings are consistent with and build upon previous literature on www.nature.com/scientificreports/ the challenges facing women in pursuing and remaining in academic careers 10, 16, 19, 40 . they are also consistent with more recent gender disparity findings, such as those observed in research published on covid-19 41 . other studies investigating the basis for such a gender gap have identified several layers of barriers to the advancement of women in academic medicine. these include sexism in the academic environment, lack of mentorship, and inequity with regards to resource allocation, salary, space, and positions of influence 42, 43 . our study suggests that additional network factors such as relatively low centrality, which indicates a lack of access to other individuals of influence, and high homophily, which indicates a lack of access to new ideas and perspectives, also perpetuate the gender gap-corroborating recent findings from graduate school social networks 44 . it is somewhat encouraging that there has been a steady increase in the proportion of authorship by women since 1980 (fig. 3a) . this increase is observed approximately a decade after the passage of title ix of the us civil rights act in 1972. given that the majority of authors in this network are clinicians, a partial explanation could be that us-based women began to attend previously all-male medical schools in the early 1970s, completed their training, and began to contribute to the network as authors approximately 10 years later. if the nearly linear trend continues, we predict that gender parity for new authors entering the network will be reached by the year 2032, 26 years after us medical school enrollment approached parity 45 . however, the proportion of first/last authors who are women is growing much more slowly, and parity may not be reached for 50+ years, if at all. given that senior authorship is a traditional metric of scholarly productivity, it may be particularly difficult for clinical trialists who are women to obtain promotion under the current paradigm. one possible solution is to increase the role of joint senior authorship, which remains vanishingly rare in the clinical trials domain (furman et al. 2014 46 is one of very few examples that we are aware of)-although this is predicated on the acceptance of these roles by advancement and promotion committees. the field itself may also suffer from slow entry of new talent and a lack of broad perspectives. while the gender mapping algorithm and manual lookups are imperfect, our approach is consistent with prior work in this area 16, 47 . unisex names posed a particular challenge 48 . it should be noted that we could not account for all situations where an author changed their name (e.g., a person assumed their spouse's surname); this could have led to overestimation of representation by women and underestimation of impact, since this practice is more common with women. it is also possible that an individual's gender identity does not match the gender assignment of their given name. future work will include further analysis of gender disparities, factoring in institutional affiliation and highest degree(s) obtained, which are both likely to have significant influence on publication and senior authorship 49, 50 . there are several additional limitations to this work, starting with the fact that co-authorship is but one way to measure social network interactions and this study reports results from published trials, which induces publication bias. although hemonc.org aims to be the most comprehensive resource of its kind, non-randomized trials and randomized phase ii trials are intentionally underrepresented, given that findings at this stage of investigation infrequently translate to practice-changing results (e.g., approximately 70% of oncology drugs fail during phase ii) [51] [52] [53] . the effect of any biases introduced by this underrepresentation is unclear, given the confounding influence of publication bias, which may itself be subject to gender disparity 54 . some older literature which no longer has practice-changing implications may have been overlooked. during name disambiguation, some names could not be resolved, primarily because neither medline nor the primary journal site contained full names. this effect is non-random, since certain journals do not publish full names. the choice of coefficients and their relative weights was based on clinical intuition and consensus; given that the "worth" of metrics such as first/last authorship is fundamentally qualitative, there must be some degree of subjectivity when formulating a quantitative algorithm. while the sensitivity analysis demonstrated that neither normalized author impact score distribution, assortativity, nor modularity are majorly changed by variation in the trial design and author role coefficients, it remains possible that other combinations of coefficients and relative weightings could lead to different results. furthermore, our impact algorithm weighs heavily on first and last authorship, but the definition of senior authorship has changed over time. for example, in the 1946 article by goodman et al. 2 , the authors were listed in decreasing order of seniority (personal communication). in general, the impact score used in this paper, although similar to others proposed in the academic literature, is not validated and should be interpreted with caution. finally, the majority of authors in this database publish extensively, and their impact as measured here should not be misconstrued to reflect their contributions to the cancer field more broadly. in conclusion, we have described the first and most comprehensive social network analysis of the clinical trialists involved in chemotherapy trials. we found emergent properties of a strategic network and clear indications of gender disparities, albeit with improvement in representation in recent decades. the network has been highly modular and assortative for the past 40 years, with little collaboration across most subspecialties. as the field pivots from an anatomy-based to a precision oncology paradigm, it remains to be seen how the network will re-organize so that the incredible progress seen to date can continue. 1946-2018 and referenced on hemonc.org were considered for inclusion. hemonc.org is the largest collaborative wiki of chemotherapy drugs and regimens and has a formal curation process 55 . in order for a reference to be included on hemonc.org, it generally must include at least one regimen meeting the criteria outlined here: https ://hemon c.org/wiki/eligi bilit y_crite ria. as such, the majority of references on hemonc.org are randomized controlled trials (rcts) or non-randomized trials with at least 20 participants and/or practice-changing implications. one of the main goals of hemonc.org is creating a database of all standard of care systemic antineoplastic therapy regimens. this is difficult as there is no universally accepted definition of standard of care except in a www.nature.com/scientificreports/ legal capacity. for example, the state of washington, in its legislation on medical negligence, inversely defines the standard of care as "exercis[ing] that degree of skill, care, and learning possessed at that time by other persons in the same profession". we currently employ four separate definitions that meet the threshold of standard of care: 1. the control arm of a phase iii randomized controlled trial (rct). by implication, this means that all phase iii rcts with a control arm must eventually be included on the website. 2. the experimental arm(s) of a phase iii rct that provide(s) reasonable evidence (p-value less than 0.10) of superior efficacy for an intermediate surrogate endpoint (e.g., pfs) or a strong endpoint (e.g., os). 3. a non-randomized study that is either: 4. any study (including case series and retrospective studies) that is specifically recommended by a member of the hemonc.org editorial board. all section editors of the editorial board with direct oversight of diseasespecific pages are board-eligible or board-certified physicians. in order to identify new regimens and study references for inclusion on hemonc.org, we undertake several parallel screening methods: as part of the process of building hemonc.org, we have also systematically reviewed all lancet, jama, and new england journal of medicine tables of contents from 1946 to december 31, 2018. in addition, the citations of any included manuscript are hand-searched for additional citations. for any treatment regimen that has been subject to randomized comparison, we additionally seek to identify the first instance in which such a regimen was evaluated as an experimental arm; if no such determination can be made, we seek the earliest non-randomized description of the regimen for inclusion on the website. in order or prioritization, phase iii rcts are added first, then smaller rcts such as randomized phase ii, followed by non-randomized trials, followed by retrospective studies or case series identified by our editorial board as relevant to the practice of hematology/oncology. when a reference is added to hemonc.org, bibliographic information including authorship is recorded. the usually coincides with medline record details, although some older references in medline are capped at ten authors and are manually completed based upon the publication of record. for trials that do not list individual authors (e.g., the elderly lung cancer vinorelbine italian study group 56 ), the original manuscript and appendices are examined for a writing committee. if a writing committee is identified, the members of this committee are listed as authors in the order that they appeared in the manuscript. if no writing committee is identified, the chairperson(s) of the study group are listed as the first & last authors. if no chairpersons are listed, the corresponding author is listed as the sole author. www.nature.com/scientificreports/ publications solely consisting of the evaluation of drugs not yet approved by the fda or other international approval bodies were not included. trials that appeared in abstract form only, reviews, retrospective studies, meta-analyses, and case reports were excluded, as were trials reporting only on local interventions such as surgery, radiation therapy, and intralesional therapy. non-antineoplastic trials (table s1 ) and trials of supportive interventions (e.g., antiemesis; growth factor support) were also excluded. disambiguation of author names. for each included publication, author names were extracted and disambiguated. author names on hemonc.org are stored in the medline lastname_firstinitial (middleinitial) format, which can lead to two forms of ambiguity: (1) the short form, e.g., smith_j, can refer to two or more individuals, e.g., julian and jane smith; (2) two short forms can refer to the same individual, e.g., kantarjian_h and kantarjian_hm. additionally, names can be misspelled and individuals can change their name over time (e.g., a person assumes their spouse's surname). we undertook several steps to disambiguate names: (1) full first and middle names, when available, were programmatically accessed through the ncbi pubmed eutils 57 application programming interface; (2) when not available through medline, full first names were searched for on journal websites or through web search engines; (3) automatic rules were developed to merge likely duplicates; and (4) some names were manually merged (e.g., misspellings: benboubker_lofti and benboubker_lotfi; alternate forms: rigal-huguet_francoise and huguet-rigal_francoise; and subsumptions: baldotto_clarissa and serodio da rocha baldotto_clarissa). transformation algorithms are available upon request, and the full mapping table is provided in supplemental file 1. gender mapping. once the name disambiguation step was complete, we mapped authors with full name available to gender. we first mapped names to genders using us census data, which includes the relative frequencies of given names by gender in the population of us birth from 1880 to 2017. we calculated the gender ratio for names that appeared as both genders. for names with gender ratio > 0.9 for one gender (e.g., john, rebecca), we assigned the name to that gender. to expand gender mapping to include names that are more frequently seen internationally (e.g., jean, andreas), we used a program that searches from a dictionary containing gender information about names from most european countries as well as some asian and middle eastern countries 58 . for unmatched first names (e.g., dana, michele), we manually reviewed for potential gender assignment. for some names that are masculine in certain countries and feminine in others (e.g., andrea, daniele, and pascale are masculine in italy and feminine elsewhere), we mapped based on surnames. finally, we performed manual internet searches to look for photographs and pronouns used in online content such as faculty profiles, book biographies, and professional social media accounts for the remaining unmapped full names associated with a longevity of greater than one year. a total of 25,698 (88%) authors were assigned to the categories of woman (n = 8511; 29.2%) or man (n = 17,187; 58.9%). the gender of most of the people with unassigned names could not be determined because they only appeared with initials (n = 2716; 9.3%) in the primary publication and medline. the remaining n = 685 (2.3%) were ambiguously gendered names that could not be resolved through manual searching, and were excluded in the gender-specific analyses. the full mapping table is provided in supplemental file 2. author impact score. we considered existing metrics for measuring author impact 59-62 , but ultimately proceeded with our own formulation given some of the unique considerations of prospective clinical trials and their impact. every author was assigned an impact score, using an algorithm calculated per manuscript using four coefficients: (1) author role; (2) trial type; (3) citation score; (4) primary versus updated analysis. the coefficients are multiplied to arrive at the score, and the total author impact score is summed across all of their published manuscripts. author role: first and last author roles are assigned a coefficient of three; middle authors are assigned a coefficient of one. when joint authorship is denoted in a medline record, there is an additional attribute "equalcontrib" that is set to "y" (yes). we look for this during the parsing process and treat these authors as first or last authors when the attribute is detected. trial type: any prospective trial with randomization is denoted as randomized and the authors of any manuscript reporting on such a trial are assigned a coefficient of two. non-randomized trials are assigned a coefficient of one. for manuscripts that reported on more than one trial with mixed designs (i.e., one or more randomized and one or more non-randomized trials), the randomized coefficient was used. citation score: we programmatically obtained a snapshot of citation counts from google scholar from september 2019 and used unadjusted total citations as the citation score coefficient for the years 1946-2008. as more recent publications are still accruing citations, raw citation count is not an appropriate measure of their impact. therefore, we have calculated a blended citation score for articles published between 2009-2018, adding the phased in median citation count for the journal tier in which the article was published for the years 1946-2008 (see tables s4 & s5 and figure s14 ). the citations scores are normalized to the manuscript with the maximum number of citations (stupp et al. 2005 63 , with 13,341 citations), such that the maximum citation score is one. primary publications vs. updates: the baseline coefficient is one. for updates, this score is multiplied by a half-life decay coefficient; i.e., scores for the first update are multiplied by 50%; scores for the second update by 25%; and so forth. this rule is applied equally to updates and subgroup analyses. for manuscripts that reported on pooled updates of more than one trial, the score was multiplied by the half-life coefficient corresponding to the update that resulted in the maximum score. see examples in supplemental methods. www.nature.com/scientificreports/ subspecialty designation of each publication. each publication was assigned to one of 13 diseasespecific cancer subspecialties based on the cancer(s) studied (table s1 ). the majority of publications report on a clinical trial carried out in one disease or several diseases mapping to the same subspecialty. for publications studying diseases that map to more than one subspecialty, each author's impact score for that publication was divided evenly across the subspecialties. several clinical trials employ a site-agnostic approach, e.g., to a "cancer of unknown primary" or to biomarker-defined subsets of cancers (e.g., a basket trial 64 ); for these, impact across subspecialties was split manually (table s6) . subspecialty designation based on authorship. authors were eligible for assignment to a primary subspecialty based on whether they were a first or last author at least once in the subspecialty, or whether they had a cumulative impact of at least one standard deviation below the mean of the author impact score of all authors in the subspecialty. authors who met either of these criteria were assigned to a primary subspecialty based on where the majority of their impact lay; if an author had equal impact in two or more subspecialties they were assigned equally to the subspecialties. this assignment was recalculated on an annual basis if the author had new publications, and primary subspecialty was re-assigned if a new subspecialty met either of the criteria and the impact in that subspecialty was higher than in the previous primary subspecialty. authors not meeting either of these criteria were assigned a primary subspecialty of "none" and were not included in the homophily analysis or the network visualization. social network construction and metrics. a dynamic social network was created with nodes representing authors and links representing co-authorship. the dynamic social network was discretized by year and the authors, scores, and links were cumulative (e.g., the 20 th network was cumulative from 1946-1965). therefore, once an author is added to the network, they remain in the network, with their impact score cumulatively increasing as they publish and remaining constant if publication activity ceases. the following temporal metrics were calculated: (1) network density (the number of actual connections/links present divided by the total number of potential connections); (2) modularity 65 by subspecialty (a measure of how strongly a network is divided into distinct communities, in this case subspecialties, defined as the number of edges that fall within a set of specified communities minus the number expected in a network with the same number of vertices and edges whose edges were placed randomly); (3) assortativity 66 by subspecialty (a measure of the preference of nodes in a network to attach to others that are similar in a defined way, in this case the same subspecialty; assortativity is positive if similar vertices tend to connect to each other, and negative if they tend to not connect to each other); (4) betweenness centrality 67 (a measure reflecting how important an author is in connecting other authors, calculated as the proportion of times that an author is a member of the bridge that forms the shortest path between any two other authors); (5) pagerank 68 (another measure of centrality, this time considering the connection patterns among each author's immediate neighbors; its value for each author is the probability that a person starting at any random author and randomly selecting links to other authors will arrive at the author); and (6) proportion of co-authors sharing either the same primary subspecialty designation or the same gender (hereafter referred to as homophily). network density, modularity, and assortativity are calculated at the network level, while betweenness centrality, pagerank, and homophily are calculated at the author (node) level. further definitions of these metrics are provided in the supplemental glossary. all metrics incorporated the weighted co-authorship score, which takes into account each co-author's impact modified by the number of authors of an individual publication. for each pairwise collaboration, as defined by co-authorship on the same manuscript, a co-authorship score was calculated and used as the edge weight; duplicated edges were allowed to reflect the fact that weights could be distributed in a non-even fashion (e.g., two co-authors could be middle authors on a lower-impact publication as well as senior authors on a separate high-impact publication). this score was first calculated by multiplying the individual authors' manuscriptspecific impact scores together. in order to acknowledge the role of middle authors in large multi-institutional studies, this preliminary score was divided by the total number of authors on the manuscript. this has the effect of decreasing the weight of any individual co-authorship relationship in a paper with many authors, while allowing the overall weight of the neighborhood consisting of all co-authorship connections to increase linearly with the number of authors (see examples in supplemental methods). in order to visualize the final cumulative network, layout was determined using the distributed recursive graph algorithm 69 . nodes were sized by author impact score rank and colored by primary subspecialty designation. edge width was determined by the weighted co-authorship score. statistical analysis. non-independent network metrics including growth, density, assortativity, modularity, and pagerank are reported descriptively with medians and interquartile ranges (iqr). gender proportion over time was fit with locally estimated scatterplot smoothing (loess) regression using default settings of degree = 2 with smoothing parameter/span α = 0.75 70 . for the final cumulative network, the independent variables author impact score and longevity were compared (1) between genders and (2) by whether the author changed subspecialties over time; only those authors with longevity ≥ 1 year were included in the second comparison. these comparisons were made with the two-sided wilcoxon rank sum test; p value < 0.05 was considered statistically significant. www.nature.com/scientificreports/ sensitivity analysis. to determine whether the scoring algorithm was robust to modifications, we conducted a sensitivity analysis where the author role and trial design coefficients were varied by ± 67% and ± 50%, respectively. normalized density distributions for the final cumulative network under each permutation were calculated, and temporal assortativity and modularity were compared to baseline with pearson's correlation coefficient. a version of this manuscript is posted on the medrxiv preprint server, accessible here: https ://www.medrx iv.org/conte nt/10.1101/19010 603v1 . a very early version of the work was presented in poster format at the 2018 visual analytics in healthcare workshop (november 2018). there are no other prior presentations. the datasets generated and analyzed in this study are available at harvard dataverse 71 . received: 3 january 2020; accepted: 17 september 2020 scientific reports | (2020) 10:17536 | https://doi.org/10.1038/s41598-020-73466-6 www.nature.com/scientificreports/ the biological actions and therapeutic applications of the b-chloroethyl amines and sulfides nitrogen mustard therapy; use of methyl-bis (beta-chloroethyl) amine hydrochloride and tris (beta-chloroethyl) amine hydrochloride for hodgkin's disease, lymphosarcoma, leukemia and certain allied and miscellaneous disorders general principles of cancer chemotherapy historical and methodological developments in clinical trials at the national cancer institute a history of cancer chemotherapy cancer statistics associating co-authorship patterns with publications in high-impact journals breast cancer publication network: profile of co-authorship and co-organization nepotism and sexism in peer-review inequality quantified: mind the gender gap expectations of brilliance underlie gender distributions across academic disciplines gender contributes to personal research funding success in the netherlands comparison of national institutes of health grant amounts to first-time male and female principal investigators women and academic medicine: a review of the evidence on female representation the 'gender gap' in authorship of academic medical literature-a 35-year perspective bibliometrics: global gender disparities in science gender disparities in high-quality research revealed by nature index journals the gender gap in highest quality medical research-a scientometric analysis of the representation of female authors in highest impact medical journals historical comparison of gender inequality in scientific careers across countries and disciplines the structure of scientific collaboration networks strategic networks access to expertise as a form of social capital: an examination of race-and class-based disparities in network ties to experts broadening the science of broadening participation in stem through critical mixed methodologies and intersectionality frameworks the perils of intersectionality: racial and sexual harassment in medicine combination chemotherapy with adriamycin and cyclophosphamide for advanced breast cancer 1-phenylalanine mustard (l-pam) in the management of primary breast cancer. a report of early findings tamoxifen (antiestrogen) therapy in advanced breast cancer antitumor activity of thalidomide in refractory multiple myeloma phase i trial of the proteasome inhibitor ps-341 in patients with refractory hematologic malignancies phase i/ii study of ipilimumab for patients with metastatic melanoma improved survival with ipilimumab in patients with metastatic melanoma leadership in complex networks: the importance of network position and strategic action in a translational cancer research network a unified framework for the pareto law and matthew effect using scale-free networks experience versus talent shapes the structure of the web topology of evolving networks: local events and universality threshold models of collective behavior quantifying the evolution of individual scientific impact cancer chemotherapy-present status and prospects chemotherapy in the treatment of leukemia and wilms' tumor women in academic medicine leadership: has anything changed in 25 years? covid-19 amplifies gender disparities in research why aren't there more women leaders in academic medicine? the views of clinical department chairs the "gender gap" in authorship of academic medical literature-a 35-year perspective a network's gender composition and communication pattern predict women's leadership success distribution of medical school graduates by gender idelalisib and rituximab in relapsed chronic lymphocytic leukemia gender bias in scholarly peer review name-centric gender inference using data analytics research productivity in academia: a comparative study of the sciences, social sciences and humanities the gender gap in peer-reviewed publications by physical therapy faculty members: a productivity puzzle comparison of evidence of treatment effects in randomized and nonrandomized studies can the pharmaceutical industry reduce attrition rates? contradicted and initially stronger effects in highly cited clinical research double-blind peer review and gender publication bias org: a collaborative online knowledge platform for oncology professionals effects of vinorelbine on quality of life and survival of elderly patients with advanced non-small-cell lung cancer trying an authorship index measuring co-authorship and networking-adjusted scientific impact how has healthcare research performance been assessed? a systematic review a new index to use in conjunction with the h-index to account for an author's relative contribution to publications with high impact radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma new clinical trial designs in the era of precision medicine: an overview of definitions, strengths, weaknesses, and current use in oncology modularity and community structure in networks assortative mixing in networks a set of measures of centrality based on betweenness the anatomy of a large-scale hypertextual web search engine drl: distributed recursive (graph) layout locally weighted regression: an approach to regression analysis by local fitting replication data for: seven decades of chemotherapy clinical trials: a pan-cancer social network analysis vanderbilt university) conducted and are responsible for the data analysis. we declare the following interests gibson are members of the editorial board of hemonc.org. rozina a. chowdhery, ronald s. go and eddy j. chen were members of the editorial board of hemonc.org. all positions at hemonc.org are voluntary and uncompensated, and the stock of hemonc.org llc has no monetary value none of the funders had any direct role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. supplementary information is available for this paper at https ://doi.org/10.1038/s4159 8-020-73466 -6.correspondence and requests for materials should be addressed to j.l.w.reprints and permissions information is available at www.nature.com/reprints.publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution 4.0 international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons licence, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons licence, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/. key: cord-333088-ygdau2px authors: roy, manojit; pascual, mercedes title: on representing network heterogeneities in the incidence rate of simple epidemic models date: 2006-03-31 journal: ecological complexity doi: 10.1016/j.ecocom.2005.09.001 sha: doc_id: 333088 cord_uid: ygdau2px abstract mean-field ecological models ignore space and other forms of contact structure. at the opposite extreme, high-dimensional models that are both individual-based and stochastic incorporate the distributed nature of ecological interactions. in between, moment approximations have been proposed that represent the effect of correlations on the dynamics of mean quantities. as an alternative closer to the typical temporal models used in ecology, we present here results on “modified mean-field equations” for infectious disease dynamics, in which only mean quantities are followed and the effect of heterogeneous mixing is incorporated implicitly. we specifically investigate the previously proposed empirical parameterization of heterogeneous mixing in which the bilinear incidence rate si is replaced by a nonlinear term ks p i q , for the case of stochastic sirs dynamics on different contact networks, from a regular lattice to a random structure via small-world configurations. we show that, for two distinct dynamical cases involving a stable equilibrium and a noisy endemic steady state, the modified mean-field model approximates successfully the steady state dynamics as well as the respective short and long transients of decaying cycles. this result demonstrates that early on in the transients an approximate power-law relationship is established between global (mean) quantities and the covariance structure in the network. the approach fails in the more complex case of persistent cycles observed within the narrow range of small-world configurations. most population models of disease (anderson and may, 1992) assume complete homogeneous mixing, in which an individual can interact with all others in the population. in these well-mixed models, the disease incidence rate is typically represented by the term bsi that is bilinear in s and i, the number of susceptible and infective individuals (bailey, 1975) , b being the transmission coefficient. with these models it has been possible to establish many important epidemiological results, including the existence of a population threshold for the spread of disease and the vaccination levels required for eradication (kermack and mckendrick, 1927; anderson and may, 1992; smith et al., 2005) . however, individuals are discrete and not well-mixed; they usually interact with only a small subset of the population at any given time, thereby imposing a distinctive contact structure that cannot be represented in mean-field models. explicit interactions within discrete spatial and social neighborhoods have been incorporated into a variety of individual-based models on a spatial grid and on networks (bolker and grenfell, 1995; johansen, a b s t r a c t mean-field ecological models ignore space and other forms of contact structure. at the opposite extreme, high-dimensional models that are both individual-based and stochastic incorporate the distributed nature of ecological interactions. in between, moment approximations have been proposed that represent the effect of correlations on the dynamics of mean quantities. as an alternative closer to the typical temporal models used in ecology, we present here results on ''modified mean-field equations'' for infectious disease dynamics, in which only mean quantities are followed and the effect of heterogeneous mixing is incorporated implicitly. we specifically investigate the previously proposed empirical parameterization of heterogeneous mixing in which the bilinear incidence rate si is replaced by a nonlinear term ks p i q , for the case of stochastic sirs dynamics on different contact networks, from a regular lattice to a random structure via small-world configurations. we show that, for two distinct dynamical cases involving a stable equilibrium and a noisy endemic steady state, the modified mean-field model approximates successfully the steady state dynamics as well as the respective short and long transients of decaying cycles. this result demonstrates that early on in the transients an approximate power-law relationship is established between global (mean) quantities and the covariance structure in the network. the approach fails in the more complex case of persistent cycles observed within the narrow range of small-world configurations. # 2006 elsevier b.v. all rights reserved. 2005). simplifications of these high-dimensional models have been developed to better understand their dynamics, make them more amenable to mathematical analysis and reduce computational complexity (keeling, 1999; eames and keeling, 2002; franc, 2004) . these approximations are based on moment closure methods and add corrections to the mean-field model due to the influence of covariances, as well as equations for the dynamics of these second order moments (pacala and levin, 1997; bolker, 1999; brown and bolker, 2004) . we address here an alternative simplification approach closer to the original mean-field formulation, which retains the basic structure of the mean-field equations but incorporates the effects of heterogeneous mixing implicitly via modified functional forms (mccallum et al., 2001) . specifically, the bilinear transmission term (si) in the well-mixed equations is replaced by a nonlinear term s p i q (severo, 1969) , where the exponents p, q are known as ''heterogeneity parameters''. this formulation allows an implicit representation of distributed interactions when the details of individual-level processes are unavailable (as is often the case, see gibson, 1997) , and when field data are collected in the form of a time series (e.g., koelle and pascual, 2004) . we henceforth refer to these modified equations as the heterogeneous mixing, or ''hm'', model following maule and filipe (in preparation) . the hm model is known to exhibit important properties not observed in standard mean-field models, such as the presence of multiple equilibria and periodic solutions (liu et al., 1986 (liu et al., , 1987 hethcote and van den driessche, 1991; hochberg, 1991) . this model has also been successfully fitted to the experimental time series data of lettuce fungal disease to explain its persistence (gubbins and gilligan, 1997) . however, it is not well known whether these modified mean-field equations can indeed approximate the population dynamics that emerge from individual level interactions. motivated by infectious diseases of plants, maule and filipe (in preparation) have recently compared the dynamics of the hm model to a stochastic susceptible-infective (si) model on a spatial lattice. in this paper, we implement a stochastic version of the susceptible-infective-recovered-susceptible (sirs) dynamics, to consider a broader range of dynamical behaviors including endemic equilibria and cycles (bailey, 1975; murray, 1993; johansen, 1996) . recovery from disease leading to the development of temporary immunity is also relevant to many infectious diseases in humans, such as cholera (koelle and pascual, 2004) . for the contact structure of individuals in the population we use a small-world algorithm, which is capable of generating an array of configurations ranging from a regular grid to a random network (watts and strogatz, 1998) . theory on the structural properties of these networks is well developed (watts, 2003) , and these properties are known to exist in many real interaction networks (dorogotsev and mendes, 2003) . a small-world framework has also been used recently to model epidemic transmission processes of severe acute respiratory syndrome or sars (masuda et al., 2004; verdasca et al., 2005) . we demonstrate that the hm model can accurately approximate the endemic steady states of the stochastic sirs system, including its short and long transients of damped cycles under two different parameter regimes, for all config-urations between the regular and random networks. we show that this result implies the establishment early on in the transients of a double power-law scaling relationship between the covariance structure on the network and global (mean) quantities at the population level (the total numbers of susceptible and infective individuals). we also demonstrate the existence of a complex dynamical behavior in the stochastic system within the narrow small-world region, consisting of persistent cycles with enhanced amplitude and a well-defined period that are not predicted by the equivalent homogeneous mean-field model. in this case, the hm model captures the mean infection level and the overall pattern of the decaying transient cycles, but not their phases. the model also fails to reproduce the persistence of the cycles. we conclude by discussing the potential significance and limitations of these observations. the model 2.1. stochastic formulation the population structure, that is, the social contact pattern among individuals in the population, is modeled using a small-world framework as follows. we start with a spatial grid with the interaction neighborhood restricted to eight neighbors ( fig. 1a) and periodic boundary condition, and randomly rewire a fraction f of the local connections (avoiding self and multiple connections) such that the average number of connections per individual is preserved at n 0 (=8 in this case). we call f the ''short-cut'' parameter of the network. this is a two-dimensional extension of the algorithm described in watts and strogatz (1998) . as pointed out by newman and watts (1999) , a problem with these algorithms is the small but finite probability of the existence of isolated sub-networks. we consider only those configurations that are completely connected. for f = 0 we have a regular grid (fig. 1a ), whereas f = 1 gives a random network (fig. 1c ). in between these extremes, there is a range of f values near 0.01 within which the network exhibits small-world properties (fig. 1b) . in this region, most local connections remain intact making the network highly ''clustered'' like the regular grid, with occasional short-cuts that lower the average distance between nodes drastically as in the random network. these properties are illustrated with two quantities, the ''clustering coefficient'' c and the ''average path length'' l (watts, 2003) . c denotes the probability that two neighbors of a node are themselves neighbor, and l denotes the average shortest distance between two nodes in the network. the small-world network exhibits the characteristic property of having a high value of c and simultaneously a low value of l (fig. 1d ). once the network structure is generated using the algorithm described above, the stochastic sirs dynamics are implemented with the following rules: a susceptible individual gets infected at a rate n i b, where n i is the number of infective neighbors and b is the rate of e c o l o g i c a l c o m p l e x i t y 3 ( 2 0 0 6 ) 8 0 -9 0 disease transmission across a connected susceptibleinfective pair. an infective individual loses infection at a rate g and recovers temporarily. a recovered individual loses its temporary immunity at a rate d and becomes susceptible again. stochasticity arises because the rate of each event specifies a stochastic process with a poisson-distributed time-intervals between successive occurrences of the event, with a mean interval of (rate) à1 . total population size is assumed constant (demography and disease-induced mortality are not considered), and infection propagates from an infective to a susceptible individual only if the two are connected. correlations develop as the result of the local transmission rules and the underlying network structure. therefore, holding b, g and d constant while varying the short-cut parameter f allows us to explore the effects of different network configurations (such as fig. 1a -c) on the epidemic. analytical considerations one way to analytically treat the above stochastic system is by using a pair-wise formulation (keeling, 1999) , which considers partnerships as the fundamental variables, and incorporates the pair-wise connections into model equations. using the notations of keeling et al. (1997) , this formulation gives the following set of equations for the dynamics of disease growth, and [r n ] denote respectively the number of susceptible, infective and recovered individuals each with exactly n connections, and [s n i m ] denotes the number of connected susceptible-infective pairs with n and m connections. by writing p n ½s n ¼ ½s ¼ s, where s is the total number of susceptible individuals, and p n p m ½s n i m à á ¼ p n ½s n i ¼ ½si, where [si] denotes the total number of connected susceptible-infective pairs, we can rewrite the equations for the number of susceptible, infective and recovered individuals as even though this set of equations is exact, it is not closed and additional equations are needed to specify the dynamics of the [si] pairs, which in turn depend on the dynamics of triples, etc., in an infinite hierarchy that is usually closed by moment closure approximations. however, a satisfactory closure scheme for a locally clustered network is still lacking (but see keeling et al., 1997; rand, 1999 ). here we pursue a different avenue to approximate the stochastic system with modified mean-field equations, which consider only the dynamics of mean quantities but replace the standard bilinear term bsi with a nonlinear transmission rate as follows, where k, p, q are the ''heterogeneity'' parameters (severo, 1969; liu et al., 1986; hethcote and van den driessche, 1991; hochberg, 1991) . we call eq. (2) the ''heterogeneous mixing'' (hm) model (maule and filipe, in preparation) . we note from eq. (1) that the incidence rate of the epidemic can be estimated by counting the number of connected susceptible-infective pairs [si] in the network. furthermore, [si] is directly related to the correlation c si that arises between susceptible and infective individuals in the network (keeling, 1999) . therefore, comparing eqs. (2) with (1) we see that the hm model implicitly assumes a double power-law relationship between this covariance structure and the abundances of infective and susceptible individuals. for instance, in a homogeneous network (such as a regular grid) with identical number of connections n 0 for all individuals, we have where n = s + i + r is the population size (keeling, 1999) . relationships such as eq. (3) provide an important first step towards understanding how the phenomenological parameters k, p and q are related to network structure. for a homogeneous random network in which every individual is connected to exactly n 0 randomly distributed others (see appendix a), the susceptible and infective individuals are uncorrelated and the total number of interacting pairs [si] = (n 0 /n)si. eq. (1) then reduce to these equations incorporate the familiar bilinear term si for the incidence rate, and provide a mean-field approximation for the stochastic system in which each individual randomly mixes with n 0 others. note that the transmission coefficient b is proportionately reduced by a factor n 0 /n, which is the fraction of the population in a contact neighborhood of each individual. in a completely well-mixed population, n 0 = n, and these equations reduce to the standard kermack-mckendrick form (kermack and mckendrick, 1927) . eq. (4) exhibit either a disease-free equilibrium, i(t) = 0, or an endemic equilibrium, i(t) = [dn/(g + d)][1 à g/bn 0 ], depending on whether the basic reproductive ratio r 0 = n 0 b/g is less or greater than unity. it is to be noted that while eq. (4) describe a homogeneous random network exactly, it provides only an approximation for the random network with f = 1, in which individuals have a binomially distributed number of connections around a mean n 0 (appendix a). details of the implementation of the stochastic system are described in appendix b. one practical approach to estimate the parameters k, p and q of the hm model, when the individual-level processes are unknown, would be to fit these parameters using time series data (gubbins and gilligan, 1997; bjørnstad et al., 2002; finkenstä dt et al., 2002) . indeed, with a sufficient number of parameters a satisfactory agreement between the model and the data is almost always possible. a direct fit of time series, however, will not tell us whether the disease transmission rate is well approximated by the functional form ks p i q of the model. we instead fit the parameters k, p, q to the transmission rate ''observed'' in the output of the stochastic simulation. specifically, we obtain least-squared estimates of k, p, q by fitting the term ks p i q to the computed number of pairs [si] that gives the disease transmission rate of the stochastic system (see eq. (1)). we then incorporate these estimates in eq. (2), and compare the infective time series produced by this hm model to that generated by the original stochastic network simulation. example. in this way, we can address whether the transmission rate is well captured by the modified functional form, and if that is the case, whether the hm model approximates successfully the aggregated dynamics of the stochastic system. we compare the stochastic simulation with the predictions of three sets of model equations, representing different degrees of approximation of the system. besides the hm model described above, we consider the bilinear mean-field model given by eq. (4), which assumes k = n 0 /n and p = q = 1. this comparison demonstrates the inadequacy of the wellmixed assumption built into the bilinear formulation. we also discuss a restricted hm model with an incidence function of the form ðn 0 =nþs pr i qr in eq. (2), with only two heterogeneity parameters p r and q r , as originally proposed by severo (1969) and studied by liu et al. (1986) , hethcote and van den driessche (1991) and hochberg (1991) . the stochastic sirs dynamics are capable of exhibiting a diverse array of dynamical behaviors, determined by both the epidemic parameters b, g, d and the network short-cut parameter f. we choose the following three scenarios: stable equilibrium: infection levels in the population reach a stable equilibrium relatively rapidly after a short transient (fig. 3a) . noisy endemic state: infection levels exhibit stochastic fluctuations around an endemic state following a long transient of decaying cycles (fig. 3b ). persistent cycles: fluctuations with a well defined period and enhanced amplitude persist in the small-world region near f = 0.01 (fig. 3b ). the reason for choosing these different temporal patterns is to test the hm model against a wider range of behaviors of the stochastic system. the oscillatory case has epidemiological significance because of the observed pervasiveness of cyclic disease patterns (cliff and haggett, 1984; grenfell and harwood, 1997; pascual et al., 2000) . fig. 3a presents simulation examples of the epidemic time series for three values of the short-cut parameter f, representing the regular grid (f = 0), the small-world network (f = 0.01) and the random network (f = 1). the transient pattern depends strongly on f: a high degree of local clustering in a regular grid slows the initial buildup of the epidemic, whereas in a random network with negligible clustering (fig. 1d ) the disease grows relatively fast. the transient for the small-world network lies in between these two extremes. by contrast, the stable equilibrium level of the infection remains insensitive to f, implying that the equilibrium should be well predicted by the bilinear mean-field approximation (eq. (4)) itself. least-squared estimates of the two sets of heterogeneity parameters [k, p, q] and [k r = n 0 /n, p r , q r ], for the full and restricted versions of the hm model respectively, are obtained for a series of f values corresponding to different network configurations, as described in section 3. the disease parameters b, g and d are kept fixed throughout, making the epidemic processes operate at the same rates across different networks, so that the effects of the network structure on the dynamics can be studied independently. transient patterns, however, presents a different picture. the mean-field trajectory deviates the most from the stochastic simulation for the regular grid (f = 0), and the least for the random network (f = 1). the full hm model with its three parameters k, p and q, on the other hand, demonstrates an excellent agreement with the stochastic transients for all values of f. by comparison, the transient patterns of the restricted hm model with only two fitting parameters p r and q r differ significantly for low values of f ( fig. 4a and b) . the poor agreement of the restricted hm and the mean-field transients with the stochastic data for a clustered network (low f) is due to the failure of their respective incidence functions to fit the transmission rate of the stochastic system ( fig. 2a) . on the other hand, the random network has negligible clustering, and the interaction between susceptible and infective individuals is sufficiently well mixed for the restricted hm model to provide as good an approximation of the stochastic transient as the full hm model (fig. 4c ). the estimates [k, p, q] = [0.0001, 0.94, 0.97] and [n 0 /n, p r , q r ] = [0.00005, 0.99, 1] for these two models are also quite similar. the discrepancy for the mean-field transients (fig. 4c) is due to the fact that the mean-field model gives only an approximate description of the random network with f = 1 as noted before. at the other extreme, for a regular grid the estimates of the full and restricted hm models are [k, p, q] = [1.66, 0.3, 0.69] and [n 0 /n, p r , q r ] = [0.00005, 0.84, 1.13], which differ considerably from each other. fig. 5a and b demonstrate how the parameters k, p and q of the full hm model depend on the short-cut parameter f. all three of them approach their respective well-mixed values (k = 0.00005, p = q = 1) as f ! 1, and they deviate the most as f ! 0, in accord with the earlier discussion. in particular, k is significantly higher, and likewise p and q are lower, for the regular grid than a well-mixed system, implying a strong nonlinearity of the transmission mechanism in a clustered network. such a large value of k can be understood within the context of the incidence function ks p i q , and explains why only two parameters, p r and q r in the restricted hm model, cannot absorb the contribution of k. in a homogeneous random network with n 0 connections per individual, the term (n 0 /n)si gives the expected total number of pairs [si] that govern disease transmission in the network, and the exponent values p = q = 1 indicate random mixing (of susceptible and infective individuals). by contrast, local interactions in a clustered network lower the availability of susceptible individuals (infected neighbors of an infective individual act as a barrier to disease spread), resulting in a depressed value of the exponent p significantly below 1. this nonlinear effect, combined with a low initial infective number i 0 (0.5% of the total population randomly infected) requires k in the hm model to be large enough to match the disease transmission in the network. indeed, as table 1 demonstrates, both k and p are quite sensitive to i 0 for a regular grid, unlike the other exponent q that does not 5)) is plotted against f for the three epidemic models. depend on initial conditions. increasing i 0 facilitates disease growth by distributing infective and susceptible individuals more evenly, which causes an increase of the value of p and a compensatory reduction of k. an interesting pattern in fig. 5a and b is that the values of the heterogeneity parameters remain fairly constant initially for low f, in particular within the interval 0 f < 0.01 for the exponents p and q (the range is somewhat shorter for k), and then start approaching respective meanfield values as f increases to 1. this pattern of variation is reminiscent of the plot for the clustering coefficient c shown in fig. 1d , and suggests that the clustering of the network, rather than its average path length l, influences disease transmission strongly. a measure of the accuracy of the approximation can be defined by an error function erf, computed as a mean of the point-by-point deviation of the infective time series i m (t) predicted from the models, relative to the stochastic simulation data i s (t), over the length t of the transient (the equilibrium values of the models coincide with the simulation, see fig. 4 (multiplication by 100 expresses erf as a percentage of the simulation time series). fig. 5c shows erf as a function of f for the three models. the total failure of the mean-field approximation to predict the stochastic transients is evident in the large magnitudes of error (it is 25% even for the random networks). by contrast, the excellent agreement of the full hm model for all f results in a low error throughout. on the other hand, the restricted version of the hm model gives over 30% error for low f whereas it is negligible for high f. interestingly, erf for the restricted hm and mean-field models show similar patterns of variation with f as in fig. 5b , staying relatively constant within 0 f < 0.01 and then decreasing relatively fast. local clustering in a network with low f causes disease transmission to deviate from a well-mixed approximation, and thus influences the pattern of erf for these simpler models. the second type of dynamical behavior of the stochastic system exhibits a relatively long oscillatory transient that settles onto a noisy endemic state for most values of f, near 0 as well as above (fig. 3b ). stochastic fluctuations are stronger for f = 0 than f = 1. however, in a significant exception the cycles tend to persist with a considerably increased amplitude and well defined period for a narrow range of f near 0.01, precisely where the small-world behavior arises in the network. such persistent cycles are not predicted by the homogeneous epidemic dynamics given by eq. (4) , and are therefore a consequence of the correlations generated by the contact structure. to our knowledge such nonmonotonic pattern for the amplitude of the cycles with the network parameter f has not been observed before (see section 5 for a comparison of these results with those of other studies). we estimate two quantities, the ''coefficient of variation'' (cv) and the ''degree of coherence'' (dc), which determine respectively the strength and periodicity of the cycles for different values of f. cv has the usual definition where the numerator denotes the standard deviation of the infective time series of length t s (in the stationary state excluding transients), and the denominator denotes its mean over the same time length t s . fig. 6a exhibits a characteristic peak for cv near f = 0.01, demonstrating a maximization of the cycle amplitudes in the small-world region compared to both the high and low values of f. the plot also shows that the fluctuations at the left side tail of the peak are stronger than its right side tail. consistent with this pattern, sustained fluctuations in a stochastic sirs model on a spatial grid (f = 0) were also shown by johansen (1996) . by contrast, the low variability in the random network (f ! 1) is due to the fact that the corresponding mean-field model (eq. (4)) does not have oscillatory solutions. dc provides a measure of the sharpness of the dominant peak in the fourier power spectrum of the infective time series, and is defined as where h max , v max and dv are the peak height, peak frequency and the width at one-tenth maximum respectively of a gaussian fit to the dominant peak. the sharp nature of the peak, particularly for the small-world network, makes it unfeasible to use the standard ''width at half-maximum'' fig. 6 -the coefficient of variation, cv (eq. (6)), and the degree of coherence, dc (eq. (7)), are plotted against f in a and b, respectively (see text for definitions). each point in b represents estimates using a fourier power spectrum averaged over 10 independent realizations of the stochastic simulations. (gang et al., 1993; lago-ferná ndez et al., 2000) which is often zero here. the modified implementation in eq. (7) therefore considerably underestimates the sharpness of the dominant peak. even then, fig. 6b depicts a fairly narrow maximum for dc near f = 0.01, indicating that the cycles within the small-world region have a well-defined period. the low value of dc for f = 0 implies that the fluctuations in the regular grid are stochastic in nature. a likely scenario for the origin of these persistent cycles is as follows. stochastic fluctuations are locally maintained in a regular grid by the propagating fronts of spatially distributed infective individuals, but they are out of phase across the network. the infective individuals are spatially correlated over a length j / d à1 in the grid (johansen, 1996) , which typically has a far shorter magnitude than the linear extent of the grid used here (increasing d reduces the correlation length j further, which weakens these fluctuations and gives the stable endemic state observed for instance in fig. 3a ). the addition of a small number of short-cuts in a small-world network (fig. 1b) couples together a few of these local fronts, thereby effectively increasing the correlation length to the order of the system size and creating a globally coherent periodic response. as more short-cuts are added, the network soon acquires a sufficiently random configuration and the homogeneous dynamics become dominant. another important point to note in fig. 3b is that, in contrast to fig. 3a , the mean infection level i of the cycles is not independent of f: i now increases slowly with f. an immediate implication of this observation is that, unlike the earlier case of a stable equilibrium, the bilinear mean-field model of eq. (4) will no longer be able to accurately predict the mean infection for all f. fig. 7 shows the same examples of the stochastic time series as in fig. 3b , along with the solutions of the three models. as expected, the mean-field time series fails to predict the mean infection level at varying degrees in all three cases, deviating most for the regular grid (f = 0) and least for the random network (f = 1). by comparison, the equilibrium solutions of the full and restricted versions of the hm model both demonstrate good agreement with the mean infection level of the stochastic system. for the transient patterns, the two hm models exhibit similar decaying cycles of roughly the same period, and also of the same transient length, as the stochastic time series but they occur at a different phase. even though the transient cycles of the hm models persist the longest for f = 0.01, they eventually decay onto a stable equilibrium and thus fail to predict the persistent oscillations of the smallworld network. the mean-field model, on the other hand, shows damped cycles of much shorter duration and hence is a poor predictor overall. the close agreement of the two hm time series with each other for f = 0.01 (fig. 7b) is due to the fact that the leastsquared estimate of k for the full hm model is 0.00005, equal to n 0 /n of the restricted hm, and the exponents p, q likewise reduce to p r , q r . within the entire range 0 f 1, the estimates of p and p r for the full and restricted hm models lie between [0.68, 1.08] and [0.82, 0.93], respectively, whereas q and q r both stay close to a value of 1.1. it is interesting to note here that a necessary condition for limit cycles in an sirs dynamics with a nonlinear incidence rate s p i q is q > 1 (liu et al., 1986) , which both the full and restricted hm models appear to satisfy. one possible reason then for their failure to reproduce the cycles in the smallworld region is the overall complexity of the stochastic time series, which results from nontrivial correlation patterns present in the susceptible-infective dynamics. the threeparameter incidence function ks p i q of the full hm model may not have sufficient flexibility to adequately fit the cyclic incidence pattern of the stochastic system. we emphasize here that if the cycles are not generated intrinsically, but are driven by an external variable such as a periodic environmental forcing, the outcome is well predicted when an appropriate forcing term is included in eq. (2) (results not shown). as a final note, all of the above observations for both stable equilibria and cyclic epidemic patterns have been qualitatively validated for multiple sets of values of the disease parameters b, g and d. stochastic sirs dynamics implemented on a network with varying degrees of local clustering can generate a rich spectrum of behaviors, including stable and noisy endemic equilibria as well as decaying and persistent cycles. persistent cycles arise in our system even though the homogeneous mean-field dynamics do not have oscillatory solutions, thereby revealing an interesting interplay of network structure and disease dynamics (also see rand, 1999) . our results demonstrate that a three-variable epidemic model with a nonlinear incidence function ks p i q , consisting of three ''heterogeneity '' parameters [k, p, q] , is capable of predicting the disease transmission patterns, including the transient and stable equilibrium prevalence, in a clustered network. the relatively simpler (and more standard) form s pr i qr with two parameters [p r , q r ] falls short in this regard. this restrictive model, however, is an adequate predictor of the dynamics in a random network, for which the bilinear mean-field approximation cannot explain the transient pattern. interestingly, even the function ks p i q cannot capture the complex dynamics of persistent cycles in a small-world network that has simultaneously high local clustering and long-distance connectivity. it is worth noting, however, that such persistent cycles appear within a small region of the parameter space for f, and therefore the hm model appears to provide a reasonable approximation for most cases of clustered as well as randomized networks. an implication of these findings is that an approximate relationship is established early on in the transients, lasting all the way to equilibrium, between the covariance structure of the [si] pairs and the global (mean) quantities s and i. this relationship is given by a double power law of the number of susceptible and infective individuals. it allows the closure of the equations for mean quantities, making it possible to approximate the stochastic dynamics with a simple model (hm) that mimics the basic formulation of the mean-field equations but with modified functional forms. it reveals an interesting scaling pattern from individual to population dynamics, governed by the underlying contact structure of the network. in lattice models for antagonistic interactions, which bear a strong similarity to our stochastic disease system, a number of power-law scalings have been described for the geometry of the clusters (pascual et al., 2002) . it is an open question whether the exponents for the dynamic scaling (i.e., parameters p and q here) can be derived from such geometrical properties. it also needs to be determined under what conditions power-law relationships will hold between local structure and global quantities. the failure of the hm model to generate persistent cycles may result from an inappropriate choice of the incidence function ks p i q . it remains to be seen if there exists a different functional form that better fits the incidence rate of the stochastic system and is capable of predicting the variability in the data. it is also not known whether a moment closure method including the explicit dynamics of the covariance terms themselves (pacala and levin, 1997; keeling, 1999) can provide a good approximation to the mean infection level in a network with high degree of local clustering. of course, heterogeneities in space or in contact structure are not the only factors contributing to the nonlinearity in the transmission function s p i q ; a number of other biological mechanisms of transmission can lead to such functional forms. by rewriting bks p i q as ½bks pà1 i qà1 si bsi, wherebðs; iþ now represents a density-dependent transmission efficiency in the bilinear (homogeneous) incidence framework, one can relateb to a variety of density-dependent processes such as those involving vector-borne transmission, or threshold virus loads etc (liu et al., 1986) . interestingly, it has been suggested that in such cases cyclic dynamics are likely to be stabilized, rather than amplified, by nonlinear transmission (hochberg, 1991) . it appears then that network structure can contribute to the cyclic behavior of diseases with relatively simple transmission dynamics. it is interesting to consider the persistent cycles we have discussed here in light of other studies on fluctuations in networks. on one side, cycles have been described for random networks with f = 1 because the corresponding well-mixed dynamics also have oscillatory solutions (lago-ferná ndez et al., 2000; kuperman and abramson, 2001) . at the opposite extreme, johansen (1996) reported persistent fluctuations in a stochastic sirs model on a regular grid (f = 0), strictly generated by the local clustering of the grid since the meanfield equations do not permit cycles. recent work by verdasca et al. (2005) extends johansen's observation by showing that fluctuations do occur in clustered networks from a regular grid to the small-world configuration. they describe a percolation type transition across the small-world region, implying that the fluctuations fall off sharply within this narro terval. this observation is in significant contrast to our results, where the amplitudes of the cycles are maximized by the small-world configuration, and therefore require both local clustering and some degree of randomization. one difference between the two models is that verdasca et al. (2005) use a discrete time step for the recovery of infected individuals, while in our event-driven model, time is continuous and the recovery time is exponentially distributed. a more systematic study of parameter space for these models is warranted. we should also mention that there are other ways to generate a clustered network than a small-world algorithm. for example, keeling (2005) described a method that starts with a number of randomly placed focal points in a twodimensional square, and draws a proportion of them towards their nearest focal point to generate local clusters. network building can also be attempted from the available data on selective social mixing (morris, 1995) . the advantage of our small-world algorithm is that besides being simple to implement, it is also one of the best studied networks (watts, 2003) . this algorithm generates a continuum of configurations from a regular grid to a random network, and many real systems have an underlying regular spatial structure, as in the case of hantavirus of wild rats within the city blocks of baltimore (childs et al., 1988) . moreover, emergent diseases like the recent outbreak of severe acute respiratory syndrome (sars) have been studied by modeling human contact patterns using small-world networks (masuda et al., 2004; verdasca et al., 2005) . the network considered here remains static in time. while this assumption is reasonable when disease spreads rapidly relative to changes of the network itself, there are many instances where the contact structure would vary over comparable time scales. examples include group dynamics in wildlife resulting from schooling or spatial aggregation, as well as territorial behavior. dynamic network structure involves processes such as migration among groups that establishes new connections and destroys existing ones, but also demographic processes such as birth and death as well as disease induced mortality. another topic of current interest is the effect of predation on disease growth, which splices together predator-prey and host-pathogen dynamics in which the prey is an epidemic carrier (ostfeld and holt, 2004) . simple dynamics assuming the homogeneous mixing of prey and predators makes interesting predictions about the harmful effect of predator control in aggravating disease prevalence with potential spill-over effects on humans (packer et al., 2003; ostfeld and holt, 2004) . it remains to be seen if these e c o l o g i c a l c o m p l e x i t y 3 ( 2 0 0 6 ) 8 0 -9 0 conclusions hold under an explicit modeling framework that binds together the social dynamics of both prey and predator. more generally, future work should address whether modified mean-field models provide accurate simplifications for stochastic disease models on dynamic networks. so far the work presented here for static networks provides support for the empirical application of these simpler models to time series data. statistical mechanics of complex networks infectious diseases of humans: dynamics and control the mathematical theory of infectious diseases dynamics of measles epidemics: estimating scaling of transmission rates using a time series sir model analytic models for the patchy spread of plant disease space, persistence and dynamics of measles epidemics the effects of disease dispersal and host clustering on the epidemic threshold in plants the ecology and epizootiology of hantaviral infections in small mammal communities of baltimore: a review and synthesis island epidemics evolution of networks modelling dynamic and network heterogeneities in the spread of sexually transmitted disease a stochastic model for extinction and recurrence of epidemics: estimation and inference for measles outbreaks metapopulation dynamics as a contact process on a graph stochastic resonance without external periodic force (meta) population dynamics of infectious diseases a test of heterogeneous mixing as a mechanism for ecological persistence in a disturbed environment some epidemiological models with nonlinear incidence non-linear transmission rates and the dynamics of infectious disease a simple model of recurrent epidemics correlation models for childhood epidemics the effects of local spatial structure on epidemiological invasions the implications of network structure for epidemic dynamics a contribution to the mathematical theory of epidemics disentangling extrinsic from intrinsic factors in disease dynamics: a nonlinear time series approach with an application to cholera modeling infection transmission small world effect in an epidemiological model fast response and temporal coherent oscillations in small-world networks influence of nonlinear incidence rates upon the behavior of sirs epidemiological models dynamical behavior of epidemiological models with non-linear incidence rate how should pathogen transmission be modelled? relating heterogeneous mixing models to spatial processes in disease epidemics transmission of severe acute respiratory syndrome in dynamical small-world networks data driven network models for the spread of disease mathematical biology the spread of epidemic disease on networks scaling and percolation in the small-world network model are predators good for your health? evaluating evidence for top-down regulation of zoonotic disease reservoirs biologically generated spatial pattern and the coexistence of competing species keeping the herds healthy and alert: impacts of predation upon prey with specialist pathogens cholera dynamics and el niñ o-southern oscillation simple temporal models for ecological systems with complex spatial patterns epidemic spreading in scale-free networks correlation equations and pair approximations for spatial ecologies persistence and dynamics in lattice models of epidemic spread percolation on heterogeneous networks as a model for epidemics generalizations of some stochastic epidemic models the impacts of network topology on disease spread ecological theory to enhance infectious disease control and public health policy contact networks and the evolution of virulence recurrent epidemics in small world networks small worlds collective dynamics of smallworld networks we thank juan aparicio for valuable comments about the work, and ben bolker and an anonymous reviewer for useful suggestions on the manuscript. this research was supported by a centennial fellowship of the james s. mcdonnell foundation to m.p. it is important to distinguish among the different types of random networks that are used frequently in the literature. one is the random network with f = 1 that is generated using the small-world algorithm as described in section 2 ( fig. 1c ), which has a total nn 0 /2 distinct connections, where n 0 is the original neighborhood size (=8 here) in the regular grid and n is the size of the network. each individual in this random network has a binomially distributed number of contacts around a mean n 0 . there is also the homogeneous random network discussed in relation to the mean-field eq. (4), which by definition has fixed n 0 random contacts per individual (keeling, 1999) . these two networks are, however, different from the random network of erdő s and ré nyi (albert and barabá si, 2002) , generated by randomly creating connections with a probability p among all pairs of individuals in a population. the expected number of distinct connections in the population is then pn(n à 1)/2, and each individual has a binomially distributed number of connections with mean p(n à 1). for moderate values of p and large population sizes, the erdő s-ré nyi network is much more densely connected than the first two types. all three of them, however, have negligible clustering c and path length l, since the individuals do not retain any local connections (all connections are short-cuts). an appropriate network is constructed with a given f, and the stochastic sirs dynamics are implemented on this network using the rules described in section 2. for the initial conditions, we start with a random distribution of a small number of infective individuals, only 0.5% of the total population (=0.005n) unless otherwise stated, in a pool of susceptible individuals. all generated time series used for least-squared fitting of the transmission rate have a length of 20,000 time units. the structure of the network remains fixed during the entire stochastic run. stochastic simulations were carried out with a series of network sizes ranging from n = 10 4 -10 6 . the results presented here are those for n = 160,000 and are representative of other sizes. the values for the epidemic rate parameters b, g and d are chosen so that the disease successfully establishes in the population (a finite fraction of the population remains infected at all times). r e f e r e n c e s key: cord-327651-yzwsqlb2 authors: ray, bisakha; ghedin, elodie; chunara, rumi title: network inference from multimodal data: a review of approaches from infectious disease transmission date: 2016-09-06 journal: j biomed inform doi: 10.1016/j.jbi.2016.09.004 sha: doc_id: 327651 cord_uid: yzwsqlb2 networks inference problems are commonly found in multiple biomedical subfields such as genomics, metagenomics, neuroscience, and epidemiology. networks are useful for representing a wide range of complex interactions ranging from those between molecular biomarkers, neurons, and microbial communities, to those found in human or animal populations. recent technological advances have resulted in an increasing amount of healthcare data in multiple modalities, increasing the preponderance of network inference problems. multi-domain data can now be used to improve the robustness and reliability of recovered networks from unimodal data. for infectious diseases in particular, there is a body of knowledge that has been focused on combining multiple pieces of linked information. combining or analyzing disparate modalities in concert has demonstrated greater insight into disease transmission than could be obtained from any single modality in isolation. this has been particularly helpful in understanding incidence and transmission at early stages of infections that have pandemic potential. novel pieces of linked information in the form of spatial, temporal, and other covariates including high-throughput sequence data, clinical visits, social network information, pharmaceutical prescriptions, and clinical symptoms (reported as free-text data) also encourage further investigation of these methods. the purpose of this review is to provide an in-depth analysis of multimodal infectious disease transmission network inference methods with a specific focus on bayesian inference. we focus on analytical bayesian inference-based methods as this enables recovering multiple parameters simultaneously, for example, not just the disease transmission network, but also parameters of epidemic dynamics. our review studies their assumptions, key inference parameters and limitations, and ultimately provides insights about improving future network inference methods in multiple applications. dynamical systems and their interactions are common across many areas of systems biology, neuroscience, healthcare, and medicine. identifying these interactions is important because they can broaden our understanding of problems ranging from regulatory interactions in biomarkers, to functional connectivity in neurons, to how infectious agents transmit and cause disease in large populations. several methods have been developed to reverse engineer or, identify cause and effect pathways of target variables in these interaction networks from observational data [1] [2] [3] . in genomics, regulatory interactions such as disease phenotype-genotype pairs can be identified by network reverse engineering [1, 4] . molecular biomarkers or key drivers identified can then be used as targets for therapeutic drugs and directly benefit patient outcomes. in microbiome studies, network inference is utilized to uncover associations amongst microbes and between microbes and ecosystems or hosts [2, 5, 6] . this can include insights about taxa associations, phylogeny, and evolution of ecosystems. in neuroscience, there is an effort towards recovering brain-connectivity networks from functional magnetic resonance imaging (fmri) and calcium fluorescence time series data [3, 7] . identifying structural or functional neuronal pairs illuminates understanding of the structure of the brain, can help better understand animal and human intelligence, and inform treatment of neuronal diseases. infectious disease transmission networks are widely studied in public health. understanding disease transmission in large populations is an important modeling challenge because a better understanding of transmission can help predict who will be affected, and where or when they will be. network interactions can be further refined by considering multiple circulating pathogenic strains in a population along with strain-specific interventions, such as during influenza and cold seasons. thus, network interactions can be used to inform interventional measures in the form of antiviral drugs, vaccinations, quarantine, prophylactic drugs, and workplace or school closings to contain infections in affected areas [8] [9] [10] [11] . developing robust network inference methods to accurately and coherently map interactions is, therefore, fundamentally important and useful for several biomedical fields. as summarized in fig. 1 , many methods have been used to identify pairwise interactions in genomics, neuroscience [12, 13] and microbiome research [14] including correlation and information gain-based metrics for association, inverse covariance for conditional independence testing, and granger causality for causation from temporal data. further, multimodal data integration methods such as horizontal integration, model-based integration, kernelbased integration, and non-negative matrix factorization have been used to combine information from multiple modalities of 'omics' data such as gene expression, protein expression, somatic mutations, and dna methylation with demographic, diagnoses, and phenotypical clinical data. bayesian inference has been used to analyze changes in gene expression from microarray data as dna measurements can have several unmeasured confounders and thereby incorporate noise and uncertainty [15] . multi-modal integration can be used for classification tasks, to predict clinical phenotypes such as tumor stage or lymph node status, for clustering of patients into subgroups, and to identify important regulatory modules [16] [17] [18] [19] [20] . in neuroscience, not just data integration, but multimodal data fusion has been performed by various methods such as linear regression, structural equation modeling, independent component analysis, principal component analysis, and partial least squares [21] . multiple modalities such as fmri, electroencephalography, and diffusion tensor imaging (dti) have been jointly analyzed to uncover more details than could be captured by a single imaging technique [21] . in metagenomics, network inference from microbial data has been performed using methods such as inverse covariance and correlation [2] . in evolutionary biology, the massive generation of molecular data has enabled bayesian inference of phylogenetic trees using markov chain monte carlo chain (mcmc) techniques [22, 23] . in infectious disease transmission network inference, bayesian inference frameworks have been primarily used to integrate data such as dates of pathogen sample collection and symptom report date, pathogen genome sequences, and locations of patients [24] [25] [26] . this problem remains challenging as the data generative processes and scales of heterogeneous modalities may be widely different, transformations applied to separate modalities may not preserve the interactions between modalities, and separately integrated models may not capture interaction effects between modalities [27] . as evidence mounts regarding the complex combination of biological, environmental, and social factors behind disease, emphasis on the development of advanced modeling and inference methods that incorporate multimodal data into singular frameworks has increased. these methods are becoming more important to consider given that the types of healthcare data available for understanding disease pathology, evolution, and transmission are numerous and growing. for example, internet and mobile connectivity has enabled mobile sensors, point-of-care diagnostics, web logs, and participatory social media data which can provide complementary health information to traditional sources [28, 29] . in the era of precision medicine, it becomes especially important to combine clinical information with biomarker and environmental information to recover complex genotype-phenotype maps [30] [31] [32] [33] . infectious disease networks are one area where the need to bring together data types has long been recognized, specifically to better understand disease transmission. data sources including high-throughput sequencing technologies have enabled genomic data to become more cost effective, offering support for studying transmission by revealing pathways of pathogen introduction and evolution in a population. yet, genomic data in isolation is insufficient to obtain a comprehensive picture of disease in the population. while these data can provide information about pathogen evolution, genetic diversity, and molecular interaction, they do not capture other environmental, spatial, and clinical factors that can affect transmission. for infectious disease surveillance, this information is usually conveyed through epidemiological data, which can be collected in various ways such as in clinical settings from the medical record, or in more recent efforts through web search logs, or participatory surveillance. participatory surveillance data types typically include age, sex, date of symptom onset, and diagnostic information such as severity of symptoms. in clinical settings, epidemiological data are generally collected from patients reporting illness. this can include, for example, age at diagnosis, sex, race, family history, diagnostic information such as severity of symptoms, and phenotypical information such as presence or absence of disease which may not be standardized. highthroughput sequencing of pathogen genomes, along with linked spatial and temporal information, can advance surveillance by increasing granularity and leading to a better understanding of the spread of an infectious disease [37] . considerable efforts have been made to unify genomic and epidemiologic information from traditional clinical forms into singular statistical frameworks to refine understanding of disease transmission [24] [25] [26] [34] [35] [36] . one approach to design and improve disease transmission models has been to analytically combine multiple, individually weak predictive signals in the form of sparse epidemiological, spatial, pathogen genomic, and temporal data [24, 25, 34, 35, 38] . molecular epidemiology is the evolving field wherein the above data types are considered together; epidemiological models are used in concert with pathogen phylogeny and immunodynamics to uncover disease transmission patterns [39] . pathogen genomic data can capture within-host pathogen diversity (the product of effective population size in a generation and the average pathogen replication time [25, 26] ) and dynamics or provide information critical to understanding disease transmission such as evidence of new transmission pathways that cannot be inferred from epidemiological data alone [40, 41] . in addition, the remaining possibilities can then be examined using any available epidemiological data. as molecular epidemiology and infectious disease transmission are areas in which network inference methods have been developed for bringing together multimodal data we use this review to investigate the foundational work in this specific field. a summary of data types, relevant questions and purpose of such studies is summarized in fig. 2 , and we further articulate the approaches below. in molecular epidemiology, several approaches have been used to overlay pathogen genomic information on traditionally collected epidemiologic information to recover transmission networks. additional modeling structure is needed in these problems because infectious disease transmission occurs through contact networks of heterogeneous individuals, which may not be captured by compartmental models such as susceptible-infec tious-recovered (sir) and susceptible-latent-infectious-recov ered (slir) models [42] . as well, for increased utility in epidemiology, there is a necessity to estimate epidemic parameters in addition to the transmission network. unlike other fields wherein recovery of just the topology of the networks is desired, in molecular epidemiology bayesian inference is commonly used to reverse engineer infectious disease transmission networks in addition to estimating epidemic parameters (fig. 2 ). while precise features can be extracted from observed data, there are latent variables not directly measured which must simultaneously be considered to provide a complete picture. thus, bayesian inference methods have been used to simultaneously infer epidemic parameters and structure of the transmission network in a single framework. instead of capturing pairwise interactions, such as correlations or inverse covariance, bayesian inference is capable of considering all nodes and inferring a global network and transmission parameters [7] . moreover, bayesian inference is capable of modeling noisy, partially sampled realistic outbreak data while incorporating prior information. while this review focuses on infectious disease transmission, network inference methods have implications in many areas. modeling network diffusion and influence, identifying important nodes, link prediction, influence probabilities and community topology and parameter detection are key questions in several fields ranging from genomics to social network analysis [43] . analogous frameworks can be developed with different modalities of observational genomics or clinical data to model information propagation and capture the influences of nodes, nodes that are more influential than others, and the temporal dynamics of information diffusion. for modeling information spread in such networks, influence and susceptibility of nodes can serve to be analogous to epidemic transmission parameters. however, these modified methods should also account for differences in the method of information propagation in such networks from infectious disease transmission by incorporating constraints in the form of temporal decay of infection, strengths of ties measured from biological domain knowledge, and multiple pathways of information spread. to identify the studies most relevant for this focused review, we queried pubmed. for practicality and relevance, our search, summarized in fig. 3 , was limited to papers from the last ten years. as our review is focused on infectious disease transmission network inference, we started with the keywords 'transmission' and 'epidemiological'. to ensure that we captured studies that incorporate pathogen genomic data, we added the keywords 'genetic', 'genomic' and 'phylogenetic' giving 5557 articles total. next, to narrow the results to those that are comprised of a study of multi-modal data, we found that the keywords 'combining' or 'integrating' alongside 'bayesian inference' or 'inference' were comprehensive. these filters yielded 73 and 61 articles in total. we found that some resulting articles focused on outbreak detection, sexually transmitted diseases, laboratory methods, and phylogenetic analysis. also, the focus of several articles was to either overlay information from different modalities or to sequentially analyze them to eliminate unlikely transmission pathways. after a full-text review to exclude these and focus on methodological approaches, 8 articles resulted which use bayesian inference to recover transmission networks from multimodal data for infectious diseases, and which represent the topic of this review. this included bayesian likelihood-based methods for integrating pathogen genomic information with temporal, spatial, and epidemiological characteristics for infectious diseases such as foot and mouth disease (fmd), and respiratory illnesses, including influenza. as incorporating genomic data simultaneously in analytical multimodal frameworks is a relatively novel idea, the literature on this is limited. recent unified platforms have been made available to the community for analysis of outbreaks and storing of outbreak data [44] . thus, it is essential to review available literature on this novel and burgeoning topic. for validation, we repeated our queries on google scholar. although google scholar generated a much broader range of papers, based on the types of papers indexed, we verified that it also yielded the articles selected from pubmed. we are confident in our choice of articles for review as we have used two separate publications databases. below we summarize the theoretical underpinnings of the likelihood-based framework approaches, inference parameters, and assumptions about each of these studies and articulate the limitations, which can motivate future research. infectious disease transmission study is a rapidly developing field given the recent advent of widely available epidemiological, social contact, social networking and pathogen genomic data. in this section we briefly review multimodal integration methods for combining pathogen genomic data and epidemiological data in a single analysis, for inferring infection transmission trees and epidemic dynamic parameters. advances in genomic technology such as sequences of whole genomes of rna viruses and identifying single nucleotide variations using sensitive mass spectrometry have enabled the tracing of transmission patterns and mutational parameters of the severe acute respiratory syndrome (sars) virus [45] . in this study, phylogenetic trees were inferred based on phylogenetic analysis using parsimony (paup â�� ) using a maximum likelihood criterion [46] . mutation rate was then inferred based on a model which assumes that the number of mutations observed between an isolate and its fig. 3 . study design and inclusion-exclusion criteria. this is a decision tree showing our searches and selection criteria for both pubmed and google scholar. we focused only on genomic epidemiology methods utilizing bayesian inference for infectious disease transmission. ancestor is proportional to the mutation rate and their temporal difference [47] . their estimated mutation rate was similar to existing literature on mutation rates of other viral pathogens. phylogenetic reconstruction revealed three major branches in taiwan, hong kong, and china. gardy et al. [29] analyzed a tuberculosis outbreak in british columbia in 2007 using whole-genome pathogen sequences and contact tracing using social network information. epidemiological information collection included completing a social network questionnaire to identify contact patterns, high-risk behaviors such as cocaine and alcohol usage, and possible geographical regions of spread. pathogen genomic data consisted of restriction-fragmentlength polymorphism analysis of tuberculosis isolates. phylogenetic inference of genetic lineage based on single nucleotide polymorphisms from the genomic data was performed. their method demonstrated that transmission information inference such as identifying a possible source patient from contact tracing by epidemiological investigation can be refined by adding ancestral and diversity information from genomic data. in one of the earliest attempts to study genetic sequence data, as well as dates and locations of samples in concert, jombart et al. [38] proposed a maximal spanning tree graph-based approach that went beyond existing phylogenetic methods. this method was utilized to uncover the spatiotemporal dynamics of the influenza a (h1n1) from 2009 and to study its worldwide spread. a total of 433 gene sequences of hemagglutinin (ha) and of neuraminidase (na) were obtained from genbank. classical phylogenetic approaches fail to capture the hierarchical relationship between both ancestors and descendants sampled at the same time. using their algorithm called seqtrack [48] , the authors constructed ancestries in samples based on a maximal-spanning tree. seqtrack [38] utilizes the fact that in the absence of recombination and reverse mutations, strains will have unique ancestors characterized by the fewest possible mutations, no sample can be the ancestor of a sample which temporally preceded it, and the likelihood of ancestry can be estimated from the genomic differentiation between samples. seqtrack was successful in reconstructing the transmission trees in both completely and incompletely sampled outbreaks unlike phylogenetic approaches, which failed to capture ancestral relationships between the tips of trees. however, this method cannot capture the underlying within-host virus genetic parameters. moreover, mutations generated once can be present in different samples and transmission likelihood based on genetic distance may not be reliable. the above methods exploit information from different modalities separately. recent methodological advancements have seen simultaneous integration of multiple modalities of data in singular bayesian inference frameworks. in the following section we discuss state-of-the-art approaches based on bayesian inference, to reconstruct partially-observed transmission trees and multiple origins of pathogen introduction in a host population [25, 34, 35, 49, 50] . we specifically focus on bayesian likelihood-based methods as the methods consider heterogeneous modalities in a single framework and simultaneously infer the transmission network and epidemic parameters such as rate of infection transmission and rate of recovery. infectious disease transmission network inference is one problem area wherein there is a foundational literature of bayesian inference methods; reviewing them together allows understanding and comparison of specific related features across models. methods are summarized in table 1 . in bayesian inference, information recorded before the study is included as a prior in the hypothesis. based on bayes theorem as shown below, this method incorporates prior information and likelihoods from the sample data to compute a posterior probability distribution or, pã°hypothesisjdataã�. the denominator is a normalization constant or, the marginal probability density of the sample data computed over all hypotheses [51] . the hypothesis for this problem can be expressed in the form of a transmission network over individuals, locations, or farms, parameters such as rate of infectiousness and recovery, or mutation probability of pathogens. the posterior probability distribution can then be estimated as in the equation below. the posterior probability is then a measure that the inferred transmission tree and parameters are correct. it can be extremely difficult to analytically compute the posterior probability distribution as it involves iterating over all possible combinations of branches of such a transmission tree and parameter values. however, it is possible to approximate the posterior probability distribution using mcmc [52] techniques. in mcmc, a markov chain is constructed which is described by the state space of the parameters of the model and which has the posterior probability distribution as its stationary distribution. for an iteration of the mcmc, a new tree is proposed by stochastically altering the previous tree. the new tree is accepted or rejected based on a probability computed from a metropolis-hastings or gibbs update. the quality of the results from the mcmc approximation can depend on the number of iterations that it is run for, the convergence criterion and the accuracy of the update function [22] . cottam et al. [40] developed one of the earliest methods to address this problem studying foot-and-mouth disease (fmd) in twenty farms in the uk. in this study, fmd virus genomes (the fmd virus has a positive strand rna genome and it is a member of the genus aphthovirus in the family picornaviridae) were collected from clinical samples from the infected farms. the samples were chosen so that they could be used to study variation within the outbreak and the time required for accumulation of genetic change, and to study transmission events. total rna was extracted directly from epithelial suspensions, blood, or esophageal suspensions. sanger sequencing was performed on 42 overlapping amplicons covering the genome [53] . as the rna virus has a high substitution rate, the number of mutations was sufficient to distinguish between different farms. they designed a maximum likelihood-based method incorporating complete genome sequences, date at which infection in a farm was identified, and the date of culling of the animals. the goal was to trace the transmission of fmd in durham county, uk during the 2001 outbreak to infer the date of infection of animals and most likely period of their infectiousness. in their approach, they first generated the phylogenies of the viral genomes [54, 55] . once the tip of the trees were generated, they constructed possible transmission trees by recursively working backwards to identify a most recent common ancestor (mrca) in the form of a farm and assigned each haplotype to a farm. the likelihood of each tree was then estimated using epidemiological data. their study included assumptions of the mean incubation time prior to being infectious to be five days, the distribution of incubation times to follow a discrete gamma distribution, the most likely date of infection to be the date of reporting minus the date of the oldest reported lesion of the farm minus the mean incubation time, and the farms to be a source of infection immediately after being identified as infected up to the day of culling. spatial dependence in the transmission events was determined from the transmission tree by studying mean transmission distance. [25] developed a bayesian likelihood-based framework integrating genetic and epidemiological data. this method was tested on an epidemic dataset of 241 poultry farms in an epidemic of avian influenza a (h7n7) in the netherlands in 2003 consisting of geographical, genomic, and date of culling data. consensus sequences of the ha, na and polymerase pb2 genes were derived by pooling sequence data from five infected animals for 185 out of the 241 farms analyzed. the likelihood of one farm infecting another increased if the former was not culled at the time of infection of the latter, if they were in geographical proximity, or if the sampled pathogen genomic sequences were related. their model included several assumptions such as non-correlation of genetic distance, time of infection, and geographical distance between host and target farms. the likelihood function was generated as follows: for the temporal component, a farm could infect another if its infection time was before the infection time of the target farm or if the infection time of the latter was between the infection and culling time of the former. if a farm was already culled, its infectiousness decayed exponentially. for the geographical component, two farms could infect each other with likelihood equal to the inverse of the distance between them. this likelihood varied according to a spatial kernel. for the genomic component, probabilities of transitions and transversions, and the presence or absence of a deletion was considered. if there was no missing data, the likelihood function was just a product of independent geographical, genomic, and temporal components. this method also allowed missing data by assuming that all the links to a specific missing data type are in one subtree. mcmc [52] was performed to sample all possible transmission trees and parameters. marginalizing over a large number of subtrees over all possible values can also prove computationally expensive. mutations were assumed to be fixed in the population before or after an infection, ignoring a molecular clock. in the method by morelli et al. [24] , the authors developed a likelihood-based function that inferred the transmission trees and infection times of the hosts. the authors assumed that a premise or farm can be infected at a certain time followed by a latency period, a time period from infectiousness to detection, and a time of pathogen collection. this method utilized the fmd dataset from the study by cottam et al. in order to simplify the posterior distribution further, latent variables denoting unobserved pathogens were removed and a pseudo-distribution incorporating the genetic distance between the observed and measured consensus sequences was generated. the posterior distribution corresponded to a pseudo-posterior distribution because the pathogens were sampled at observation time and not infection time. the genetic distance was measured by hamming distance between sequences in isolation without considering the entire genetic network. several assumptions including independence of latency time and infectiousness period were made. in determining the interval from the end-of-latency period to detection, the informative prior was centered on lesion age. this made this inference technique sensitive to veterinary estimates of lesion age. this study considered a single source of viral introduction in the population, which is feasible if the population size considered is small. this technique did not incorporate unobserved sources of infection and assumed all hosts were sampled. the authors also assumed that each host had the same probability of being infected. teunis et al. [56] developed a bayesian inference framework to infer transmission probability matrices. the authors assumed that likelihood of infection transmission over all observed individuals would be equal to the product of conditional probability distributions between each pair of individuals i and j, and the correspond-ing entry from the transition probability matrix representing any possible transmissions from ancestors to i. the inferred matrices could be utilized to identify network metrics such as number of cases infected by each infected source and transmission patterns could be detected by analyzing pairwise observed cases during an outbreak. the likelihood function could be generated by observed times of onset, genetic distance, and geographical locations. their inferred parameters were the transmission tree and reproductive number. their method was applied to a norovirus outbreak in a university hospital in netherlands. in a method developed by ypma et al. [34] , the statistical framework for inferring the transmission tree simultaneously generated the phylogenetic tree. this method also utilized the fmd dataset from the study by cottam et al. their approach for generating the joint posterior probability of the transmission tree differed from existing methods in including the simultaneous estimation of the phylogenetic tree and within-host dynamics. the posterior probability distribution defined a sampling space consisting of the transmission tree, epidemiological parameters, and withinhost dynamics which were inferred from the measured epidemiological data and the phylogenetic tree and mutation parameters which were inferred from the pathogen genomic data. the posterior probability distribution was estimated using the mcmc technique. the performance of the method was evaluated by measuring the probability assigned to actual transmission events. the assumptions made were that all infected hosts were observed, time of onset was known, sequences were sampled from a subpopulation of the infected hosts, and a single source/host introduced the infection in the population. in going beyond existing methods, the authors did not assume that events in the phylogenetic tree coincide with actual transmission events. a huge sampling fraction would be necessary to capture such microscale genetic diversity. this method works best when all infected hosts are observed and sampled. mollentze et al. [49] have used multimodal data in the form of genomic, spatial and temporal information to address the problem of unobserved cases, an existing disease well established in a population, and multiple introductions of pathogens. their method estimated the effective size of the infected population thus being able to provide insight into number of unobserved cases. the authors modified morelli et al.'s method described above by replacing the spatial kernel with a spatial power transmission kernel to accommodate wider variety of transmission. in addition, the substitution model used by morelli et al. was modified by a kimura three parameter model [57] . this method was applied to a partially-sampled rabies virus dataset from south africa. the separate transmission trees from partially-observed data could be grouped into separate clusters with most transmissions in the under-sampled dataset being indirect transmissions. reconstructions were sensitive to choice of priors for incubation and infectious periods. in a more recent approach to study outbreaks and possible transmission routes, jombart et al. [35] , in addition to reconstructing the transmission tree, addressed important issues such as inferring possible infection dates, secondary infections, mutation rates, multiple pathways of pathogen introduction, foreign imports, unobserved cases, proportion of infected hosts sampled, and superspreading in a bayesian framework. jombart tested their algorithm outbreaker on the 2003 sars outbreak in singapore using 13 known cases of primary and secondary infection [35, 45, 58] . in this study, 13 genome sequences of severe acute respiratory syndrome (sars) were downloaded from genbank and analyzed. their method relies on pathogen genetic sequences and collection dates. similar to their previous approach [50] , their method assumed mutations to be parameters of transmission events. epidemiological pseudo-likelihood was based on collection dates. genomic pseudo-likelihood was computed based on genetic distances between isolates. this method would benefit from known transmission pathways and mutation rates and is specifically suitable for densely sampled outbreaks. their method assumed generation time-time from primary to secondary infections-and time from infection to collection were available. their method ignored within-host diversity of pathogens. instead of using a strict molecular clock, this method used a generational clock. didelot et al. [26] developed a framework to examine if wholegenome sequences were enough to capture transmission events. unlike other existing studies, the authors took into account within-host evolution and did not assume that branches in phylogenetic trees correspond to actual transmission events. the generation time corresponds to the time between a host being infected and infecting others. for pathogens with short generation times, genetic diversity may not accrue to a very high degree and one can ignore within-host diversity. however, for diseases with high latency times and ones in which the host remains asymptomatic, there is scope for accumulation of considerable within-host genetic diversity. their method used a timed phylogenetic tree from which a transmission tree is inferred on its own or can be combined with any available epidemiological support. their simulations revealed that considering within-host pathogen generation intervals resulted in more realistic phylogenies between infector and infected. the method was tested on simulated datasets and with a real-world tuberculosis dataset with a known outbreak source with only genomic data and then modified using any available epidemiological data. the latter modified network resembled more the actual transmission activity in having a web-like layout and fewer bidirectional links. their approach would work well for densely sampled outbreaks. some of the most common parameters inferred for infectious disease transmission in these bayesian approaches are the transmission tree between infected individuals or animals, the mutation rates of different pathogens, phylogenetic tree, within-host diversity, latency period, and infection dates [24, 34, 40, 26] . additional parameters in recent work are reproductive number [26] , foreign imports, superspreaders, and proportion of infected hosts sampled [35] . several simplifying assumptions have been made in the reviewed bayesian studies, limiting their applicability across different epidemic situations. in cottam's [40] approach, the phylogenetic trees generated from the genomic data are weighed by epidemiological factors to limit analysis to possible transmission trees. however, sequential approaches may not be ideal to reconstruct transmission trees and a method that combines all modalities in a single likelihood function may be necessary. ypma et al. [25] assumed that pathogen mutations emerge in the host population immediately before or following infections. moreover, the approach weighed each data type via their likelihood functions and considers each data type independent of the others, which may not be a realistic assumption. jombart et al. [38] also inferred ancestral relationships to the most closely sampled ancestor as all ancestors may not be sampled. morelli et al. [24] assumed flat priors for all model parameters. however, the method was estimated with the prior for the duration from latency to infection centered on the lesion age making the method sensitive to it and to veterinary assessment of infection age. the method developed by mollentze et al. [49] required knowledge of epidemiology for infection and incubation periods. identifying parents of infected nodes, as proposed by teunis et al., [56] assumes that all infectious cases were observed which may not be true in realistic, partiallyobserved outbreaks. didelot et al. [26] developed a framework based on a timed phylogenetic tree, which infers within-host evolutionary dynamics with a constant population size and denselysampled outbreaks. several of these approaches rely on assumptions of denselysampled outbreaks, a single pathogen introduction in the population, single infected index cases, samples capturing the entire outbreak, that all cases comprising the outbreak are observed, existence of single pathogen strains, and all nodes in the transmission network having constant infectiousness and the same rate of transmission. however, in real situations the nodes will have different infectiousness and rate of spreading from animal to animal, or human to human. moreover, the use of clinical data only is nonrepresentative of how infection transmits to a population as it generally only captures the most severely affected cases. our literature review is summarized in table 1 . as large-scale and detailed genomic data becomes more available, analyses of existing bayesian inference methods described in our review will inform their integration in epidemiological and other biomedical research. as more and more quantities of diverse data becomes available, developing bayesian inference frameworks will be the favored tool to integrate information and draw inference about transmission and epidemic parameters simultaneously. the specific focus in this review on the application of network inference in infectious disease transmission enables us to consider and comment on common parameters, data types and assumptions (summarized in table 1 ). novel data sources have increased the resolution of information as well as enabled a closer monitoring and study of interactions; spatial and genomic resolution of the bayesian network-inference studies reviewed are summarized in fig. 4 to illustrate the scope of current methods. further, we have added suggestions for addressing identified challenges in these methods regarding their common assumptions and parameters in table 2 . given the increasing number and types of biomedical data available, we also discuss how models can be augmented to harness added value from these multiple and highergranularity modalities such as minor variant identification from deep sequencing data or community-generated epidemiological data. existing methods are based on pathogen genome sequences which may largely be consensus in nature where the nucleotide or amino acid residue at any given site is the most common residue found at each position of the sequence. other recent approaches have reconstructed epidemic transmission using whole genome sequencing. detailed viral genomic sequence data can help distinguish pathogen variants and thus augment analysis of transmission pathways and host-infectee relationships in the population. highly parallel sequencing technology is now available to study rna and dna genomes at greater depth than was previously possible. using advanced deep sequencing methods, minor variations that describe transmission events can be captured and must also then be represented in models [59, 60] . models can also be encumbered with considerable selection bias by being based on clinical or veterinary data representative of a subsample of only the most severely infected hosts who access clinics. existing multi-modal frameworks are designed based on clinical data such as sequences collected from cases of influenza [35, 38] or veterinary assessment of fmd [24, 53] , which generally represent the most severe cases with access to traditional healthcare institutions and automatically inherit considerable selection bias. models to-date do not consider participatory surveillance data that has become increasingly available via mobile and internet accessibility (e.g. data from web logs, search queries, web survey-based participatory efforts such as goviral with linked symptomatic, immunization, and molecular information [61] and online social networks and social network questionnaires). another approach to improve the granularity of collected data could be community-generated data. these data can be finegrained and can capture information on a wide range of cases from asymptomatic to mildly infectious to severe. this data can be utilized to incorporate additional transmission parameters of a community which can be more representative of disease transmission. as exemplified in fig. 4a , community-generated data can be collected at the fine-grained spatial level of households, schools, workplaces, or zip codes and models must then also accommodate these spatial resolutions. studies to-date have also generally depended on available small sample sizes and some are specifically tailored to a specific disease or pathogen such as sars, avian influenza, or fmd [34, 35, 40] . hiseq platform with m. tuberculosis cdc1551 reference sequence and aligned using burrows-wheeler aligner algorithm. sars dna sequences were obtained from genbank and aligned using muscle. for avian influenza, rna consensus sequences of the haemagglutinin, neuriminidase and polymerase pb2 genes were sequenced. for h1n1 influenza, isolates were typed for hemagglutinin (ha) and neuraminidase (na) genes. methods will have to handle missing data and unobserved and unsampled hosts to be applicable to realistic scenarios. in simpler cases, assumptions of single introductions of infection with single strains being passed between hosts may be adequate. however, robust frameworks will have to consider multiple introductions of pathogens in the host population with multiple circulating strains and co-infections in hosts. in order to be truly useful, frameworks have to address questions regarding rapid mutations of certain pathogens, phylogenetic uncertainty, recombination and reassortment, population stochastics, super spreading, exported cases, multiple introductions of pathogens in a population, within and between-host pathogen evolution, and phenotypic information. methods will also need to scale up to advances in nextgeneration sequencing technology capable of producing large amounts of genomic data inexpensively [62, 63] . in the study of infectious diseases, the challenge remains to develop robust statistical frameworks that will take into account the relationship between epidemiological data and phylogeny and utilize that to infer pathogen transmission while taking into account realistic evolutionary times and accumulation of withinhost diversity. moreover, to benefit public health inference methods need to uncover generic transmission patterns, wider range of infections and risks including asymptomatic to mildly infectious cases, clusters and specific environments, and host types. network inference frameworks from the study of infectious diseases can be analogously modified to incorporate diverse forms of multimodal data and model information propagation and interactions in diverse applications such as drug-target pairs and neuronal connectivity or social network analysis. the detailed examination of models, data sources and parameters performed here can inform inference methods in different fields, and bring to light the way that new data sources can augment the approaches. in general, this will enable understanding and interpretation of influence and information propagation by mapping relationships between nodes in other applications. review of multimodal integration methods for transmission network inference a comprehensive assessment of methods for de-novo reverse-engineering of genome-scale regulatory networks sparse and compositionally robust inference of microbial ecological networks model-free reconstruction of excitatory neuronal connectivity from calcium imaging signals dialogue on reverse-engineering assessment and methods molecular ecological network analyses marine bacterial, archaeal and protistan association networks reveal ecological linkages network modelling methods for fmri modeling the worldwide spread of pandemic influenza: baseline case and containment interventions a 'smallworld-like' model for comparing interventions aimed at preventing and controlling influenza pandemics reducing the impact of the next influenza pandemic using household-based public health interventions estimating the impact of school closure on influenza transmission from sentinel data model-free reconstruction of excitatory neuronal connectivity from calcium imaging signals network modelling methods for fmri sparse and compositionally robust inference of microbial ecological networks a bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes mvda: a multi-view genomic data integration methodology information content and analysis methods for multi-modal high-throughput biomedical data a novel computational framework for simultaneous integration of multiple types of genomic data to identify microrna-gene regulatory modules a kernel-based integration of genome-wide data for clinical decision support predicting the prognosis of breast cancer by integrating clinical and microarray data with bayesian networks a review of multivariate methods for multimodal fusion of brain imaging data bayesian inference of phylogeny and its impact on evolutionary biology mrbayes 3.2: efficient bayesian phylogenetic inference and model choice across a large model space a bayesian inference framework to reconstruct transmission trees using epidemiological and genetic data unravelling transmission trees of infectious diseases by combining genetic and epidemiological data bayesian inference of infectious disease transmission from whole-genome sequence data methods of integrating data to uncover genotype-phenotype interactions why we need crowdsourced data in infectious disease surveillance wholegenome sequencing and social-network analysis of a tuberculosis outbreak novel clinico-genome network modeling for revolutionizing genotype-phenotype-based personalized cancer care integrative, multimodal analysis of glioblastoma using tcga molecular data, pathology images, and clinical outcomes an informatics research agenda to support precision medicine: seven key areas the foundation of precision medicine: integration of electronic health records with genomics through basic, clinical, and translational research relating phylogenetic trees to transmission trees of infectious disease outbreaks bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data extracting transmission networks from phylogeographic data for epidemic and endemic diseases: ebola virus in sierra leone h1n1 pandemic influenza and polio in nigeria the role of pathogen genomics in assessing disease transmission reconstructing disease outbreaks from genetic data: a graph approach molecular epidemiology: application of contemporary techniques to the typing of microorganisms integrating genetic and epidemiological data to determine transmission pathways of foot-and-mouth disease virus the distribution of pairwise genetic distances: a tool for investigating disease transmission the mathematics of infectious diseases dynamics and control of diseases in networks with community structure outbreaktools: a new platform for disease outbreak analysis using the r software mutational dynamics of the sars coronavirus in cell culture and human populations isolated in phylogenetic analysis using parsimony (and other methods). version 4, sinauer associates molecular evolution and phylogenetics adegenet: a r package for the multivariate analysis of genetic markers a bayesian approach for inferring the dynamics of partially observed endemic infectious diseases from space-time-genetic data bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data bayesian inference in ecology an introduction to mcmc for machine learning molecular epidemiology of the foot-and-mouth disease virus outbreak in the united kingdom in tcs: a computer program to estimate gene genealogies a cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and dna sequence data. iii. cladogram estimation infectious disease transmission as a forensic problem: who infected whom? estimation of evolutionary distances between homologous nucleotide-sequences comparative full-length genome sequence analysis of 14 sars coronavirus isolates and common mutations associated with putative origins of infection extensive geographical mixing of 2009 human h1n1 influenza a virus in a single university community quantifying influenza virus diversity and transmission in humans surveillance of acute respiratory infections using community-submitted symptoms and specimens for molecular diagnostic testing eight challenges in phylodynamic inference sequencing technologies-the next generation the authors declare no conflict of interest. key: cord-319055-r16dd0vj authors: dumitrescu, cătălin; minea, marius; costea, ilona mădălina; cosmin chiva, ionut; semenescu, augustin title: development of an acoustic system for uav detection † date: 2020-08-28 journal: sensors (basel) doi: 10.3390/s20174870 sha: doc_id: 319055 cord_uid: r16dd0vj the purpose of this paper is to investigate the possibility of developing and using an intelligent, flexible, and reliable acoustic system, designed to discover, locate, and transmit the position of unmanned aerial vehicles (uavs). such an application is very useful for monitoring sensitive areas and land territories subject to privacy. the software functional components of the proposed detection and location algorithm were developed employing acoustic signal analysis and concurrent neural networks (conns). an analysis of the detection and tracking performance for remotely piloted aircraft systems (rpass), measured with a dedicated spiral microphone array with mems microphones, was also performed. the detection and tracking algorithms were implemented based on spectrograms decomposition and adaptive filters. in this research, spectrograms with cohen class decomposition, log-mel spectrograms, harmonic-percussive source separation and raw audio waveforms of the audio sample, collected from the spiral microphone array—as an input to the concurrent neural networks were used, in order to determine and classify the number of detected drones in the perimeter of interest. in recent years, the use of small drones has increased dramatically. illegal activity with these uavs has also increased, or at least became more evident than before. recently, it has been reported that such vehicles have been employed to transport drugs across borders, to transport smugglers to prisons, to breach the security perimeter of airports and to create aerial images of senzitive facilities. to help protect against these activities, a drone detection product could warn of a security breach in due time to take action. this article tries to answer the following questions: (1) is it possible to build an audio detection, recognition, and classification system able to detect the presence of several drones in the environment, with relatively cheap commercial equipment (cots)? (2) assuming that it can function as a prototype, what challenges could be raised when scaling the prototype for practical use? the questions will be approached in the context of a comparison between the performance of systems using concurrent neural networks and the algorithm proposed by the authors. the proposed solution employs for the acoustic drone detector competing neural networks with spectrogram variants both in frequency and psychoacoustic scales, and increased performance for neural network architectures. two concepts are investigated in this work: (i) the way that a concept of competition in a collection of neural networks can be implemented, and (ii) how different input data can influence the performance of the recognition process in some types of neural networks. the subject of this article is in the form recognition domain, that offers a very broad field of research. recognition of acoustic signatures is a challenging task, grouping a variety of issues, which include the recognition of isolated characteristic frequencies and identification of unmanned aerial vehicles, based on their acoustic signatures. neural networks represent a tool that has proven its effectiveness in solving a wide range of applications, including automated speech recognition. most neural models approach form recognition as a unitary, global problem, without distinguishing between different input intakes. it is a known fact that the performance of neural networks may be improved via modularity and by applying the "divide et impera" principle. in this paper, the identification and classification of uavs is performed by the means of two neural networks: the self-organizing map (som) and the concurrent neural network (conn). the newly introduced conn model combines supervised and unsupervised learning paradigms and provides a solution to the first problem. a process of competition is then employed in a collection of neural networks that are independently trained to solve different sub-problems. this process is accomplished by identifying the neural network which provides the best response. as experimental results demonstrate, a higher accuracy may be obtained when employing this proposed algorithm, compared to those employed in non-competitive cases. several original recognition models have been tested and the theoretical developments and experimental results demonstrate their viability. the obtained databases are diverse, being both standard collections for different types of uavs' soundings and sets made specifically for the experiments in this paper, containing acoustic signatures of proprietary drones. based on the tests performed on some models and standard form recognition data sets, it can be illustrated that these may be also used in contexts other than the recognition of acoustic signals generated by drones. in order to reduce the complexity of recognition through a single neural network of the entire collection of isolated acoustic frequency of all drones, a solution of a modular neural network, consisting of neural networks specialized on subproblems of the initial problem has been chosen. the concurrent neural networks classification has been introduced as a collection of low-volume neural networks working in parallel, where the classification is made according to the rule where the winner takes all. the training of competing neural networks starts from the assumption that each module is trained with its own data set. the system is made up of neural networks with various architectures. multi-layered perceptron types, time-lagged and self-mapping neural network types have been used for this particular case, but other variants may also be employed. the recognition scheme consists of a collection of modules trained on a subproblem and a module that selects the best answer. the training and recognition algorithms implement these two techniques that are custom for multilayer perceptron (mlp), time delayed neural networks (tdnn) and self-organizing maps (som). mlp-conn and tdnn-conn use supervised trained modules and training instruction sets contain both positive and negative examples. in contrast, som-conn consists of modules that are trained by an unsupervised algorithm and the data consist only of positive examples. the remaining of this article is organized as follows: section 2 presents a selective study on similar scientific works (related work), section 3, the problem definition and the proposed solution, section 4, employing conns in the uav recognition process, section 5 our experimental results and section 6, the discussion and conclusions. physical threats that may arise from unauthorized flying of uavs over forbidden zones is analyzed by other researchers [7] , along with reviewing various uav detection techniques based on ambient radio frequency signals (emitted from drones), radars, acoustic sensors, and computer vision techniques for detection of malicious uavs. in s similar work [8] , the detection and tracking of multiple uavs flying at low altitude is performed with the help of a heterogeneous sensor network consisting of acoustic antennas, small frequency modulated continuous wave (fmcw) radar systems and optical sensors. the researchers applied acoustics, radar and lidar to monitor a wide azimuthal area (360 • ) and to simultaneously track multiple uavs, and optical sensors for sequential identification with a very narrow field of view. in [9] the team presents an experimental system dedicated for the detection and tracking of small aerial targets such as unmanned aerial vehicles (uavs) in particular small drones (multi-rotors). a system for acoustic detection and tracking of small objects in movement, such as uavs or terrestrial robots, using acoustic cameras is introduced in [10] . in their work, the authors deal with the problem of tracking drones in outdoor scenes, scanned by a lidar sensor placed on the ground level. for detecting uavs the researchers employ a convolutional neural network approach. afterwards, kalman filtering algorithms are used as a cross-correlation filtering, then a 3d model is built for determining the velocity of the tracked object. other technologies involved in unauthorized flying of drones over restricted areas include passive bistatic radar (pbr) employing a multichannel system [11] . in what concerns the usage of deep neural networks in this field of activity, aker and kalkan [12] present a solution using an end-to-end object detection model based on convolutional neural networks employed for drone detection. the authors' solution is based on a single shot object detection model, yolov2 [13] , which is the follow-up study of yolo w. for a better selection of uavs from the background, the model is trained to separate these flying objects from birds. in the conclusion section, the authors state that by using this method drones can be detected and distinguished from birds using an object detection model based on a cnn. further on, liu et al. [14] employ an even more complex system for drone detection, composed from a modular camera array system with audio assistance, which consists of several high-definition cameras and multiple microphones, with the purpose to monitor uavs. in the same area of technologies, popovic et al. employ a multi-camera sensor design acquiring near-infrared (nir) spectrum for detecting mini-uavs in a typical rural country environment. they notice that the detection process needs detailed pixel analysis between two consecutive frames [15] . similarly, anwar et al. perform drone detection by extracting the required features from adr sound, mel frequency cepstral coefficients (mfcc), and implementing linear predictive cepstral coefficients (lpcc). classification is performed after the feature extraction, and support vector machines (svm) with various kernels are also used for improving the classification of the received sound waves [16] . supplementary, the authors state that " . . . the experimental results verify that svm cubic kernel with mfcc outperform lpcc method by achieving around 96.7% accuracy for adr detection". moreover, the results verified that the proposed ml scheme has more than 17% detection accuracy, compared with correlation-based drone sound detection scheme that ignores ml prediction. a study on the cheap radiofrequency techniques for detecting drones is presented by nguyen et al. [17] , where they focus on autonomously detection and characterization of unauthorized drones by radio frequency wireless signals, using two combined methods: sending a radiofrequency signal and analyzing its reflection and passive listening of radio signals, process subjected to a second filtration analysis. an even more complex solution for drone detection using radio waves is presented by nuss et al. in [18] , where the authors employ a system setup based on mimo ofdm radar that can be used for detection and tracking of uavs on wider areas. keeping the research in the same field, the authors of [19] present an overview on passive drone detection with a software defined radio (sdr), using two scenarios. the authors state that "operation of a non-los environment can pose a serious challenge for both passive methods". it has been shown that the drone flight altitude may play a significant role in determining the rician factor and los probability, which in turn affects the received snr. several other approaches are presented in similar work [20, 21] . in what concerns the acoustic signature recognition, the scientific literature is comparatively rich. bernadini et al. obtained a resulting accuracy of the drone recognition of 98.3% [22] . yang et al. also propose an uav detection system with multiple acoustic nodes using machine learning models, with an empirically optimized configuration of the nodes for deployment. features including mel-frequency cepstral coefficients (mfcc) and short-time fourier transform (stft) were used by these researchers for training. support vector machines (svm) and convolutional neural networks (cnn) were trained with the data collected in person. the purpose was to determine the ability of this setup to track trajectories of flying drones [23] . in noisy environments, sound signature of uavs is more difficult to recognize. moreover, there are different environments with specific background soundings. lin shi et al. deal with this challenge and present an approach to recognize drones via sounds emitted by their propellers. in their paper, the authors declare that experimental results validate the feasibility and effectiveness of their proposed method for uav detection based on sound signature recognition [24] . similar work is described in papers [25] and [26] . finally, it can be concluded that this research field is very active and there are several issues that haven't been yet fully addressed, such as separation of the uav from environment (birds, obstructing trees, background mountains, etc.), issue depending very much on the technology chosen for drone detection. however, one approach proves its reliability-that is the usage of multisensory constructions, where weaknesses of some technologies can be compensated by others. therefore, we consider that employing a multisensory approach has more chances of success than using a single technology. classification of environmental sound events is a sub-field of computational analysis of auditory scenes, which focuses on the development of intelligent detection, recognition, and classification systems. detecting the acoustic fingerprints of drones is a difficult task because the specific acoustic signals are masked by the noises of the detection environment (wind, rain, waves, sound propagation in the open field/urban areas). unlike naturally occurring sounds, drones have distinctive sound characteristics. taking advantage of this aspect, the first part of the article focuses on building an audio detection, recognition, and classification system for the simultaneous detection of several drones in the scene. as presented in the initial part of this work, the main task of the proposed system is to detect unauthorized flying of uavs over restricted areas, by locating these vehicles and tracking them. the difficulty of the process resides in the environmental noise, and the visibility at the moment of detection. different types of microphones and a specific arrangement is used for improving the performance of the acoustic detection component. thus, the system employed for detection, recognition and automatic classification of drones using the acoustic fingerprint is composed of a hardware-software assembly as shown in figures 1 and 2 . the first functional component to be addressed is the sensing block, composed of an area of spiral-type microphones with mems, in the acoustic fields, with a spiral arrangement, shown in figure 2 . the microphone area is composed of 30 spiral-shaped mems digital microphones, so as to achieve adaptive multi-channel type weights with variable pitch. the following components have been employed for the microphone array: knowles (knowles electronics, llc, itasca, il, usa) mems microphones with good acoustic response types (typically 20 hz to >20 khz +/− 2 db frequency ratings). the system allows the detection of the presence of the acoustic signal of reduced complexity. for improving the quality of the received signal, adaptive methods to cancel the acoustic reaction, as well as adaptive methods to reduce the acoustic noise were also used. for the protection of the restricted area, the developed acoustic system was configured in a network composed of at least eight microphone array modules, arranged on the perimeter of the protected area. to increase the detection efficiency, the number of microphone array may also be increased, and the network of acoustic sensors can be configured both linearly and in depth, thus forming a safety zone around the protected area. performing acoustic measurements highlights the presence of a tonal component at frequencies of 200-5000 hz (small and medium drones-multicopter) and in the frequency range 200-10,000 hz (medium and large drones-multicopter), which is the typical sound emission of uav in the operation phase of flight. for medium and large multicopter drones the harmonics of the frequencies characteristic are also found over 10 khz (16) (17) (18) (19) (20) (21) (22) (23) (24) . the identification of this frequency is a sign of the presence of a uav in the environment. in figure 3 is presented the spiral microphone array simulation along with the beamforming analysis using multiple signal classification & direction of arrival (music doa). doa denotes the direction from which typically a propagation wave arrives at a point where a set of sensors are placed. the image in the right section shows the energetic detection of the acoustic signal generated by the drone's engines and rotors, detecting the location position (azimuth and elevation), for the two acoustic frequencies characteristic of drones (white color), represented on the frequency spectrum (bottom right). using the application in figure 3 we have tested the beamforming capabilities of the system and also directivity, using the spiral microphone array. in this simulation, the atmospheric conditions (turbulence) that may affect the propagation of sounds were not taken into account. employing a set of multiple microphones with beamforming and a signal processing technique used filtering in order to obtain a better signal reception increased the maximum detection distance in the presented mode. the process that is common for all forms of acoustic signals recognition systems is the extraction of characteristic vectors from uniformly distributed segments of time of the sampled sound signal. prior to extraction of these features, the uav generated signal must undergo the following processes: (a) filtering: the detector's input sound needs filtering to get rid of unwanted frequencies. on the other hand, the filter must not affect the reflection coefficients. in the experiments an iir notch adaptive filter has been used. (b) segmentation: the acoustic signal is non-stationary for a long-time observation, but quasistationary for short time periods, i.e., 10-30 ms, therefore the acoustic signal is divided into fixed-length segments, called frames. for this particular case, the size of a frame is 20 ms, with a generation period of 10 ms, so that a 15 ms overlap occurs from one window to the next one. (c) attenuation: each frame is multiplied by a window function, usually hamming, to mitigate the effect of finishing windows segmentation. (d) mel frequency cepstrum coefficients (mfcc) parameters: to recognize an acoustic pattern generated by the uav, it is important to extract specific features from each frame. many such features have been investigated, such as linear prediction coefficients (lpcs), which are derived directly from the speech production process, as well as the perceptual linear prediction (plp) coefficients that are based on the auditory system. however, in the last two decades, spectrum-based characteristics have become popular especially because they come directly from the fourier transform. the spectrum-based mel frequency cepstrum coefficients are employed in this research and their success is due to a filter bank which make use of wavelet transforms for processing the fourier transform, with a perceptual scale similar to the human auditory system. also, these coefficients are robust to noise and flexible, due to the cepstrum processing. with the help of the uav sonic generated specific mfcc coefficients, recognition dictionaries for the training of neural networks are then shaped. (e) feature extraction for mfcc. the extraction algorithms of the mfcc parameters are shown in figure 4 . the calculation steps are the following: • performing fft for each frame of the utterance and removing half of it. the spectrum of each frame is warped onto the mel scale and thus mel spectral coefficients are obtained. • discrete cosine transform is performed on mel spectral coefficients of each frame, hence obtaining mfcc. • the first two coefficients of the obtained mfcc are removed as they varied significantly between different utterances of the same word. liftering is done by replacing all mfcc except the first 14 by zero. the first coefficient of mfcc of each frame is replaced by the log energy of the correspondent frame. delta and acceleration coefficients are found from the mfcc to increase the dimension of the feature vector of the frames, thereby increasing the accuracy. • delta cepstral coefficients add dynamic information to the static cepstral features. for a short-time sequence c[n], the delta-cepstral features are typically defined as: where n is the index of the analysis frame and in practice m is approximately 2 or 3. coefficients describing acceleration are found by replacing the mfcc in the above equation by delta coefficients. • feature vector is normalized by subtracting their mean from each element. • thus, each mfcc acoustic frame is transformed into a characteristic vector with size 35 and used to make learning dictionaries for feature training of concurrent neural networks (feature matching). the role of the adaptive filter is to best approximate the value of a signal at a given moment, based on a finite number of previous values. the linear prediction method allows very good estimates of signal parameters, as well as the possibility to obtain relatively high computing speeds. predictor analysis is since a sample that can be approximated as a linear combination of the previous samples. by minimizing the sum of square differences on a finite interval, between real signal samples and those obtained by linear prediction, a single set of coefficients called prediction coefficients can be determined. the estimation of model parameters according to this principle leads to a set of linear equations, which can be solved efficiently for obtaining the prediction coefficients. equations (2) and (3) are considered: where h(z) is the acoustic environment feedback, z is transfer function of a linear model and a(z) = 1 − p k=1 α k z −k is the z transfer function model of reverberations and multipath reflection of environment, it is noted that it is possible to establish a connection between the gain factor constant, g, the excitation signal and the prediction error. in the case of ak = α = const, the coefficients of the real predictor and of the model are identical: e(n) = gs(n). this means that the input signal is proportional to the error signal. practically, it is assumed that the error signal energy is equal to that of the input signal: it should be noted, however, that for the uav-specific audio signal if s(n) = δ(n), it is necessary for the p-order of the predictor to be enough large so as to consider all the effects, eventually the occurrence of the transient waves. in the case of sounds without a specific uav source, the signal s(n) is assumed to be white gaussian noise with unitary variation and zero mean. (g) time-frequency analysis. the analysis of the acoustic signals can be performed by one-dimensional or two-dimensional methods. one-dimensional methods involve that the analysis is made only in the time domain or only in the frequency domain and generally have low degree of complexity. although they have the advantage of offering, in many cases, a way of quickly first evaluating and analyzing signals, in many situations, especially in the case of analyzing the transient values that appear in the acoustic signals generated by the drones, the information that is obtained, regarding the shape and the parameters they is limited and with a low degree of approximation. the second category of methods, meaning the two-dimensional representations in the time-frequency domain, represent powerful signal analysis tools and it is therefore advisable to use, if the situation allows, a pre-processing of signals, in order to identify transient waves. these representations have the advantage of allowing to emphasize certain "hidden" properties of the signals. from the point of view of the acoustic systems for detecting and analyzing the sound signals generated by the drones, it is of interest to analyze the signals at the lowest level, compared to the noise of the device. therefore time-frequency analyzes should be performed on signals affected by noise, the signal-to-noise ratio being of particular importance in assessing transient waves. a comparison is shown below in table 1 . table 1 compares the properties verified by several time-frequency representations in cohen's class. the cohen class method involves the selection of the nearest nucleus function that corresponds to the fundamental waveform that describes the acoustic signatures specific to drones. thus, the shape of the nucleus based on the peak values (localization), and the amplitude of a "control" function must be chosen. the frequency resolution corresponding to spectrum analysis, that varies over time, is equal to the nyquist frequency divided by 2 n (n = 8). the resolution in the time domain is 2 n ms (n = 4), as required by the applied method. the class of time-frequency representations, in the most general form has been described by cohen: where φ is an arbitrary function called kernel function. after the choice of this function, several specific cases are obtained corresponding to certain distributions (t, ω). the time-frequency representations in cohen's class must fulfill certain properties. compliance with these properties is materialized by imposing certain conditions on the nucleus function. the first two properties relate to the temporal and frequency gap (compatibility with filtering and modulation operations) as follows: p 1 : for these conditions to be met, it may be observed that the kernel function φ must be independent of t and ω: two other properties that must characterize time-frequency representations refer to the conservation of marginal laws: the restrictions corresponding to these properties that the function must fulfill are: the function φ must therefore take the following form: for time-frequency representations to be real, the following condition is to be met: this happens only if: the most representative time-frequency distributions in cohen's class are presented in table 2 . according to tables 1 and 2, it becomes easy to note that the wigner-ville transform has the highest number of properties, which justifies the special attention that will be given hereafter. the wigner-ville distribution. the wigner-ville interdependence of two signals is defined by: the wigner-ville self-distribution of a signal is given by: the wigner-ville distribution can be regarded as a short fourier transform in which the window continuously adapts with the signal because this window is nothing but the signal itself, reversed over time. the wigner-ville transform is thus obtained as a result of the following operations: (a) at any moment t, multiply the signal with the conjugate "mirror image", relative to the moment of evaluation: (b) calculate the fourier transform for the result of this multiplication, in relation to the offset variable τ. one of the properties of this time-frequency representation is that it can also be defined starting from the spectral functions: it is thus obtained: using the application presented in figure 5 , the spectrograms related to the sounds produced by uavs are obtained, and the results are used to the neuronal network training files. for training 30 files with wigner-ville spectrograms were made, each file having 200 spectrograms images of 128 × 128 dimension. in total a few 6000 training spectrograms for neuronal network have been employed. the presented quadratic representations, which are part of the broader category described by cohen's class, provide excellent time-frequency analysis properties of acoustic signals. following the carried out experiments, some important aspects can be emphasized regarding the use of the analysis of the acoustic signals generated by the drones using the wigner-ville time-frequency distributions, of cohen's class, namely: the energy structure of the analyzed signals can be identified and located with a good accuracy in the time-frequency plane. when the type, duration, frequency, and temporal arrangement of the signals are not a priori known, they can be estimated using time-frequency distributions. the possibility of implementing these analysis algorithms in systems for analyzing the transient acoustic signals generated by the drones becomes thus available. useful databases can be created to identify the transient acoustic signals generated by the drones detected in the environment, as their "signature" can be individualized using the wigner-ville time-frequency representations. this algorithm implements the concept of competition at the level of a collection of neural networks and determines the importance of the inputs which influence the performances in the recognition of the acoustic fingerprint, using neural networks. it is known that modularity and the "divide et impera" principle applied to neural networks can improve their performance [27] . the algorithm employs the model of concurrent neural networks (conn) that combines the paradigms of supervised and unsupervised learning and offers an optimal solution for detecting acoustic fingerprints specific to uavs. the principle of competition is used in this process within a collection of neural networks that are independently trained to solve different subproblems. the conn training was performed offline using the system in figure 1 , and the training data was performed for 3 available drones, corresponding different flying distances (0 to 25 m), (25 to 50 m), (50 to 100 m), (100 to 200 m) and (200 to 500 m) . there have been tested three quadcopter models: a dji phantom 2 (mini class), dji matrix 600 (medium class) and a homemade drone (medium class). the first training data were gathered in an anechoic chamber, for the drones specified in the article at different engine speeds, sampling frequency 44 khz. the 2nd training data set: the drone sound data was recorded in a quiet outdoor place (real-life environment without the polyphonic sound environment typical of outside areas, such as on the rooftop of a building in a calm place or isolated environment) at successive distances of 20, 60 and 120 m for two types of behavior, hovering and approaching, with a total time of 64 s. exact labeling of the drone sound was achieved by starting to record after the drone is activated and stopping before deactivation. recognition is performed by identifying the neural network that provides the best response. the experiments performed demonstrated that, compared to the cases where competition is not used at all, the obtained recognition process accuracy was higher when employing the model proposed in the present solution. the system consists of neural networks with various architectures. multilayer perceptron type modules were employed, along with time delay neural network and self-organizing map types. the recognition scheme consists of a collection of modules trained on a subproblem and a module that selects the best answer. the training and recognition algorithms, presented in modular/competing neural networks are based on the idea that, in addition to the hierarchical levels of organization of artificial neural networks: synapses, neurons, neuron layers and the network itself, a new level can be created by combining several neural networks. the model proposed in this article, called concurrent neural networks, introduces a neural recognition technique that is based on the idea of competition between multiple modular neural networks which work in parallel using the ni board controller [27, 28] . the number of networks used is equal to the number of classes in which the vectors are grouped, and the training is supervised. each network is designed to correctly recognize vectors in a single class, so that the best answers appear only when vectors from the class with which they were trained are presented. this model is in fact a framework that offers architecture flexibility because the modules can be represented by different types of neural networks. starting from the conn model proposed in this work, the concurrent self-organizing maps (csom) model has been introduced, which detaches itself as a technique with excellent performance to implementation on fpga and intel core i7 processor (3.1 ghz). the general scheme used to train competing neural networks is presented in figure 6 . in this scheme, n represents the number of neural networks working in parallel, but it is also equal to the number of classes in which the training vectors are grouped [29, 30] . the x set of vectors is obtained from the preprocessing of the acquired audio signals for the purpose of network training. from this set are extracted the sets of vectors xj, j = 1, 2 . . . n with which the n neural networks will be trained. following the learning procedure, each neural network will have to respond positively to a single class of vectors and to give negative responses to all other vectors. the training algorithm for the competing network is as follows: step 1. create the database containing the training vectors obtained from the preprocessing of the acoustic signal. step 2. the sets of vectors specific to each neural network are extracted from the database. if necessary, the desired outputs are set. step 3. apply the training algorithm to each neural network using the vector sets created in step 2. recognition and classification using conn, is performed in parallel, using the principle of competition, according to the diagram in figure 7 . it is assumed that the neural networks were trained by the algorithm described above. when applying the test vector, the networks generate an individual response, and the selection consists of choosing the network that generated the strongest response. the network selected by the winner rule is declared winner. the index of the winning network will be the index of the class in which the test vector is placed. this method of recognizing features therefore implies that the number of classes with which the competing network will work is a priori known and that there are sufficient training vectors for each class. the recognition algorithm is presented in the following steps: step 1. the test vector is created by preprocessing the acoustic signal. step 2. the test vector is transmitted in parallel to all the previously trained neural networks. step 3. the selection block sets the network index with the best answer. this will be the index of the class in which the vector is framed. the models can be customized by placing different architectures instead of the neural networks. multilayer perceptron (mlp), time delay neural networks (tdnn) and kohonen (som) maps were used for this work, thus obtaining three different types of competing neural networks. this section deals with the experiments performed on the problem of multiple drone detection with the custom collected dataset. the experiments are organized in the following order: (1) concurrent neural networks (conn) with wigner-ville spectrogram class. (2) concurrent neural networks (conn) with mfcc dictionary class. (3) concurrent neural networks (conn) with mif class. to establish the performance values it is necessary to calculate the confusion matrix. the confusion matrix consists of real values on one dimension and predicted labels for the second dimension, and each class consists of a row and a column. the diagonal elements of the matrix represent the correctly classified results. the values calculated from the confusion matrix represent precision, recall and f1-score which is the harmonic average of the accuracy and recall and the accuracy of the classification: precision: it is defined as the number of samples which contain the existence of the drone. recall: it is defined that the ratio between the expected number of samples to contain a drone and the number of samples that contain the drone. f-measure: it is defined as the harmonic average between accuracy and recall. f1 scores are calculated for each class, followed by the average of the scaled scores with weights. the weights are generated from the number of samples corresponding to each class. the spectrograms extracted in the experiment are transformed into logarithmic domain. the transformed features are used as the input to the model. the network is trained for 200 epochs with batch size of 16. this experiment resulted in classification accuracy of 91 percent, recall is 0.96, and microaverage f1-score is 0.91. the confusion matrix is shown in table 3 and classification report is show in table 4 . for the mfcc dictionary extracted in the experiment, the mel filter bank with 128 mel filters is applied. conn is trained for 200 epochs with batch size of 128 sample. this experiment resulted in classification accuracy of 87 percent, recall is 0.95, and micro-average f1-score is 0.86. the confusion matrix is shown in table 5 and classification report is show in table 6 . table 7 and classification report is show in table 8 . after observing that the conn model shows remarkable improvements of recognition rates of acoustic fingerprints compared to the classic models, this section will focus on the recognition and identification of uavs' specific acoustic signals. a training database was created using the wigner-ville spectrogram, mfcc and mif dictionaries corresponding to the acoustic signals of 6 multirotor drones. we tested six multirotor models: (1) dji matrice 600 (medium), (2) (3) (4) homemade drones (medium and large) three units, (5) dji phantom 4 (mini), (6) parrot ar drone 2 (mini). the drone was tracked outdoors on a test field between buildings, a street with pedestrian and cars/tram traffic nearby (urban conditions). the atmospheric conditions for the real-time tests were sunny weather, temperature 30-35 degrees celsius, precipitation 5%, humidity 73%, wind 5 km/h, atmospheric pressure 1013 hpa (101.3 kpa) and presence of noise in urban conditions (source: national agency for the weather). each of these drones were tested ten times. for each iteration, the training vectors of the recognition system were extracted from the first five data sets, keeping the next five data sets for testing. in this way, 200 training sets were obtained for the preparation of the system and another two hundred for its verification. in addition to speaker recognition, a set of experiments was performed using a first unique neural network to recognize the model and then conn. the results obtained in the real-time tests are presented in figures 8-18 . model dji phantom 4, type of classification-small (5) for this stage only the kohonen network was tested, given the results that were obtained in recognition speakers and their behavior compared to that of a conn. for the variant that uses a single som, the network was trained with the whole sequence of vectors obtained after preprocessing the selected acoustic signals. a kohonen network was trained in two stages with 10 × 15 nodes through the self-organizing feature map (sofm) algorithm. the first stage, the organization of clusters, took place along 1000 steps and the neighborhood gradually declined to a single neuron. in the second stage, the training was performed in 10,000 steps and the neighborhood remained fixed to the minimum size. following training and calibration of the neural network with the training vectors we obtained a set of labeled (annotation) prototypes whose structure is that of table 6 . the applied technique for recognition is described below. the acoustic frequencies identified in the test signal are preprocessed by means of a window from which a vector of the component parts is calculated. the description of the technique applied for recognition continues. the frequencies identified in the test signal are preprocessed by the means of a window from which a vector of the component parts is calculated. the window moves with a step of 50 samples and a collection of vectors is obtained, whose sequence describes the evolution of the acoustic signal specific to the drones. for each vector, the position that corresponds to the signal is kept, and the minimum quantization error, i.e., the tag that the neural network calculates. experimentally, a maximum threshold for quantization error was set to eliminate frequencies that are supposed not belonging to any class. through this process, a sequence of class labels that show how the acoustic signals specific to the drones were recognized by the system was obtained. in table 9 the experimental results are presented in percentages of recognition and identification of the drones with som and conn. in table 10 , when we refer to the "accuracy" of "conn", we refer to a different top-level architecture that: (1) takes raw audio data and creates functions for each of the three mentioned networks (2) run the data through each network and get an "answer" (a distribution of the probability of class predictions) (3) select the "correct" output of the network with the highest response (highest class confidence) (4) this architecture being explained in figure 7 . the general classifier based on concurrent neural networks, providing the same test framework for all 30 training files has been tested. using the maximum win strategy, the output tag was identified, with a resulting precision of the drone recognition is 96.3%. the time required to extract the characteristics of a 256 × 250 spectrogram image using conn is 1.26 s, while the time required to extract the characteristics of an mfcc and mif sample from audio samples is 0.5 s. the total training time required the model for the spectrograms image data set was 18 min, while the model training time for the mfcc and mif audio sample are 2.5 min. the time required to train the combined model data set was 40 min. the trained model classifies objects in 3 s. comparing the method proposed in this article with similar methods presented in the literature for drone detection using acoustic signature, which uses a supervised learning machine, the authors report detection accuracies between 79% and 98.5%, without mentioning the detection distance of the signals acoustics generated by drones [31] [32] [33] [34] [35] . the method proposed by us has an average accuracy of almost 96.3% for detecting the sounds generated by the drone, for a distance between 150 m for small class drones and 500 m for middle and large class drones. our tests were performed in a test range with a maximum length of 380 m, but from the results shown in figures 8-18 , it results that the detection distance of the acoustic signals from the drones reaches approximately 500 m, for different classes of drones. the proposed conn model classifies objects in about 4 s, this time being sufficient for warning because the network of microphone areas is distributed in width and depth, thus creating a safety zone. this paper investigates the effectiveness of machine learning techniques in addressing the problem of uav unauthorized flight detection, in the context of critical areas protection. for extracting the acoustic fingerprint of an uav, a time-frequency analysis using wigner-ville is adopted by the warning system, to recognize specific acoustic signals. dictionaries mfcc and mif (mean instantaneous frequency) coefficients specific to each type of drone have been also added in this process to improve the recognition precision of conn. the contributions of the proposed solution are the following: -development of a spiral microphone array, combining microphones in the audible and ultrasonic fields, set in an interchangeable configuration with multichannel adaptive weights. -introduction of the possibility of detecting low intensity acoustic signals specific to multirotor mini drones, at a distance of~120 m. the recognition scheme consists of a collection of models trained on a subproblem and a module that selects the best answer. -tests have shown that large multirotor (diameter 1.5 m) can be detected at a distance of~500 m, and medium multirotor (diameter less than 1 m) can be detected at a distance of at least 380 m. the possibility of integrating the microphone area in a network structure (scalability), which can be controlled by a single crio system by integrating several acquisition boards. the placement of the acoustic sensors within the network can be done linearly and in depth, so that a safety zone can be created around the perimeter restricted for the flight of drones. from the results obtained in the experiments performed, engineered features employed on conn proved to have better performances. conn architectures have resulted in better generalization performance and faster convergence for spectro-temporal data. the wigner-ville spectrograms show improved performance among other spectrogram variants (for example transformed into short-term fft-stft). the results obtained with both the datasets lead to the conclusion that multiple drone detection employing audio analysis is possible. in future work, as presented in [36] , a video camera for drone detection and recognition will be integrated in the microphone area. the two modules, acoustic and video, will work in parallel and the results will be integrated to increase the recognition capacity and classification of drones. a final radio detection (rf) module will also be integrated on the final architecture, and the results will be displayed in a command and control system. part of this research has been previously tested for developing a method for anonymous collection of travelers flowing in a public transport system and resulted in a patent application: ro a/00493, "method and system for anonymous collection of information regarding position and mobility in public transportation, employing bluetooth and artificial intelligence" in 2019. results of this research culminated in a patent application: ro a/00331, "system and method for detecting active aircraft (drone) vehicle by deep learning analysis of sound and capture images" in 2020. advances in intelligent systems and computing investigating cost-effective rf-based detection of drones drone detection systems. u.s. patent no. us 2017/0092.138a1; application no. us 15/282,216, publication of us 20170092138a1 drone detection and classification methods and apparatus based small drone detection in augmented datasets for 3d ladar detection, tracking, and interdiction for amateur drones multi-sensor field trials for detection and tracking of multiple small unmanned aerial vehicles flying at low altitude ghz fmcw drone detection system detection and tracking of drones using advanced acoustic cameras digital television based passive bistatic radar system for drone detection using deep networks for drone detection yolo9000: better, faster, stronger acoustic detection of low flying aircraft near-infrared high-resolution real-time omnidirectional imaging platform for drone detection machine learning inspired sound-based amateur drone detection for public safety applications micro-uav detection and classification from rf fingerprints using machine learning techniques mimo ofdm radar system for drone detection low-complexity portable passive drone surveillance via sdr-based signal processing drones: detect, identify, intercept and hijack a new feature vector using selected bi-spectra for signal classification with application in radar target recognition is&t international symposium on electronic imaging 2017, imaging and multimedia analytics in a web and mobile world. electron. imaging uav detection system with multiple acoustic nodes using machine learning models adaptive noise cancellation using labview empirical study of drone sound detection in real-life environment with deep neural networks convolutional neural networks for analyzing unmanned aerial vehicles sound efficient classification for multiclass problems using modular neural networks an overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes application of the wavelet transform in machine-learning malicious uav detection using integrated audio and visual features for public safety applications hidden markov model-based drone sound recognition using mfcc technique in practical noisy environments svm-based drone sound recognition using the combination of hla and wpt techniques in practical noisy environment drone sound detection by correlation drone detection based on an audio-assisted camera array uav detection employing sensor data fusion and artificial intelligence funding: this research received no external funding. the authors declare no conflict of interest. key: cord-318716-a525bu7w authors: van den oord, steven; vanlaer, niels; marynissen, hugo; brugghemans, bert; van roey, jan; albers, sascha; cambré, bart; kenis, patrick title: network of networks: preliminary lessons from the antwerp port authority on crisis management and network governance to deal with the covid‐19 pandemic date: 2020-06-02 journal: public adm rev doi: 10.1111/puar.13256 sha: doc_id: 318716 cord_uid: a525bu7w in this article we describe and illustrate what we call a network of networks perspective and map the development of a lead network of the antwerp port authority that governs various organizations and networks in the port community before and during the covid‐19 pandemic. we find that setting a collective focus and selective integration to be crucial in the creation and reproduction of an effective system to adequately deal with a wicked problem like the covid‐19 pandemic. we use the findings on crisis management and network governance to engage practitioners and public policy planners to revisit current design and governance of organizational networks within organizational fields that have been hit by the covid‐19 pandemic. in line with the recently introduced exogenous perspective on whole networks, the notion of network of networks is further elaborated in relation to the scope and nature of the problem faced by that organizational field. by changing the structure and network governance mode from a lead/network administration organization into a lead network, the antwerp port authority was able to install institutions and structures of authority and collaboration to deal with the scale and complexity of the covid-19 pandemic. a network is a system of three or more organizations (incepted either voluntary or by mandate) that work together to achieve a purpose that none of the participating organizations can achieve independently by themselves (provan, fish, and sydow 2007; provan and kenis 2008) . they are distinct entities with unique identities that require examination as a whole (provan, fish, and sydow 2007; provan and kenis 2008; raab and kenis 2009) . despite the prominence of networks in practice, their popularity as a research subject and relevance for society (raab and kenis 2009), we still tend to study individual organizations to understand the collective behavior of networks. studies on networks from multiple disciplines predominantly focus on organizations and their relations (ego-networks), potentially neglecting, or even misinterpreting the relationship between the details of a network and the larger view of the whole (bar-yam 2004; provan, fish, and sydow 2007) . in addition, we sometimes forget that studying networks from a whole network perspective is necessary, but not sufficient alone to understand ‗such issues as how networks evolve, how they are governed, and, ultimately, how collective outcomes might be generated' (provan, fish, and sydow 2007, p.480) . recently, for instance, it has been argued by nowell, hano, and yang (2019, p. 2) that an external (outside-in) perspective should accompany our dominant internal focus on networks to explain ‗the forces that may shape and constrain action in network settings'. inspired by this so-called network of networks perspective, this article shows how such a perspective allows for a better grasping of (and hence, dealing with) wicked problems (cartwright 1987) , a type of problem that an organizational field such as a port or city is frequently encountering. this article is protected by copyright. all rights reserved. the covid-19 pandemic can be understood as a wicked problem because there are no quick fixes and simple solutions to the problem, every attempt to solve the issue is a -one shot operation‖ and the nature of the problem is not understood until after the formulation of a solution (conklin 2005) . wicked problems are ‗defined [1] by a focus, rather than a boundary' (cartwright 1987, p.93) and successfully managing such problems therefore requires a reassessment of how a group of organizations and networks make temporally sense and structure a wicked problem. the covid-19 pandemic has therefore directed our attention to a pivotal point in network governance: the connection between complexity and scale (bar-yam 2004) . it has led us to acknowledge that an appreciation for the scope and detailed nature of a wicked problem is essential, while simultaneously pairing it with a network solution that matches in scale and complexity (bar-yam 2004) . to understand how to deal with the covid-19 pandemic then, one needs to comprehend the relation between a larger, complex system and the scope and nature of the problem. we call this larger, complex system an organizational field (kenis and knoke 2002) . in the classics of public administration literatures the relationship between an organization and its environment has been studied from a variety of perspectives focusing i.e., on selection or adaption to institutional pressures and resource dependence (aldrich and pfeffer 1976; oliver 1991 ). an emphasis on environments is therefore not new. interestingly, however, network scholars in public administration have only recently intensified their efforts to use concepts of the environment as an explanatory factor of the creation, reproduction, or dissolution of networks (raab, mannak, and cambré 2015; lee, rethemeyer, and park 2018; nowell, hano, and yang 2019) . this article is protected by copyright. all rights reserved. building on dimaggio and powell's (1983) understanding of an organizational field, kenis and knoke (2002, p.275) link interorganizational relationships and mechanisms such as tie formation and dissolution, to define an -organizational field-net‖ as ‗the configuration of interorganizational relationships among all the organizations that are members of an organizational field. ' the key issue here is on which scale and what details we should consider examining intersections of organizations and networks embedded in a certain environment, since environmental dynamics are crucial in our understanding of the creation and reproduction of both system within as well as the larger system as a whole (cf. mayntz 1993) . however, in order to define and examine such larger, complex systems like organizational fields, we need to understand -why‖ organizations and networks come together, cooperate, and consequently create and reproduce such a larger, complex system (kenis and knoke 2002; provan, fish, and sydow 2007; nowell, hano, and yang 2019) . we therefore propose that instead of focusing on an organizational network as the unit of analysis (provan, fish, and sydow 2007; provan and kenis 2008) , a shift to a collective of networks that is embedded in an organizational field is instructive (cf. nowell, hano, and yang 2019) . this means our unit of observation shifts from one network as separate entity with a unique identity building upon maier's (1998) system of systems approach and using nowell, hano, and yang (2019) notion of network of networks, we accordingly define network of networks as an assemblage of networks, which individually may be regarded as subsystems that are accepted article this article is protected by copyright. all rights reserved. operationally and managerial autonomous, but which are part of a larger, complex organizational field by many types of connections and flows (maier 1998; provan, fish, and sydow 2007; nowell, hano, and yang 2019) . --figure 1 . around here. ---in this article, we adopt a set-theoretic approach to network of networks, in line with the long-standing recommendation by christopher alexander (2015) . in such an approach, a network of networks is ‗best understood as clusters of interconnected structures and practices' of various networks being distinct entities and having unique identities (fiss 2007 (fiss , p.1180 provan and kenis 2008; raab et al., 2013) . this means a clean break from the predominant linear paradigm and instead adopting a systemic view in which we assume that ‗patterns of attributes will exhibit different features and lead to different outcomes depending on how they are arranged' (fiss 2007 (fiss : 1181 provan et al., 2007; provan & kenis 2008) . moreover, we note that often assumptions on the structure and governance of networks are used that are suspect at best for dealing with the complexity that networks bring (rethemeyer & hatmaker 2008; raab, lemaire, and provan 2013) . further, most network studies only employ an endogenous perspective on networks, which in some cases is bound to the performance of an individual organization, a network cluster, or a certain organizational domain (i.e., health or social care), despite the fact that networks by nature are multilevel, multidisciplinary and interdependent (provan and milward 2001; provan, fish, and sydow 2007; raab, mannak, and cambré 2015) . this article is protected by copyright. all rights reserved. in particular, scholars often tend to ignore the specific nature of the problems that networks face in their environments (raab and milward 2003; mcchrystal, collins, silverman, and fussell 2015) . this is an issue, because not fully understanding the interdependence of a collection of smaller systems nor understanding what the larger, complex system is up against makes dealing with a wicked problem like the covid-19 pandemic very difficult. as part of a larger applied research project of a collaboration between the fire and emergency services (antwerp fire service, antwerp port authority, police, municipality among others) in the port of antwerp and antwerp management school (van den oord, vriesacker, brugghemans, albers, and marynissen 2019), we focus in this article on how the port authority of the port of antwerp (belgium) dealt with the covid-19 pandemic. in particular, we examine the network structure and the embeddedness of individual actors of both the crisis management team and the leadership team of the antwerp port authority (apa) to describe how this network managed the crisis and governed the port community composed of various organizations and networks before and during the covid-19 pandemic. by providing descriptive evidence concerning the development of the overall network structure and the embeddedness of individual actors before and during the covid-19 pandemic, we aim to ground the notion of network of networks and hope to engage practitioners and public administration scholars to rethink current design and governance of organizational networks within their respective organizational fields that have been hit by the covid-19 pandemic. for this article, we narrowed the scope of the analysis to two levels of analysis to describe the interdependence between crisis management and network governance on the accepted article this article is protected by copyright. all rights reserved. operational and policy level of the antwerp port authority [2] . as alter and hage (1993) suggested we need to make a minimum distinction between the policy level and the administrative level because coordination of joint efforts tends to transcend organizational hierarchical levels as well as involve multiple different functional units like divisions or departments. the data allowed us to differentiate between these two levels having access to the crisis management team (operations or in alter and hage's terms administration) and the leadership team (policy). in order to understand how apa attempted to manage the crisis throughout the covid-19 pandemic, we conduct a network analysis based on three sources of data. for our primary data source we draw upon the records and minutes of three types of meetings: the crisis management team meetings (cmt), the nautical partners meetings (np), and the leadership team meetings (lt). the data covers a period of 12 weeks (20/01-12/04) including 53 meetings mentioning 73 unique actors involved in the port community. data records are based on 26 crisis management team meetings with a total estimated duration of 20hrs (66 logbook pages), 16 leadership team meetings with a total estimated duration of 10hrs (19 logbook pages) and 11 nautical partners meetings with a total estimated duration of 12hrs. in addition, we consult data from sciensano, which is the belgian institute for health responsible for the epidemiological follow-up of the covid-19 epidemic in collaboration with its partners and other healthcare actors. these data provide insight into the dynamics of the pandemic. the third source of data were our co-authors two, four and five that managed the pandemic in the port of antwerp. the second author attended all the crisis management team meetings and participated in various leadership team meetings, while the fourth author was present in some of the task force meetings (not examined in this article). by collaborating with accepted article this article is protected by copyright. all rights reserved. these practitioners, we are able to go back and forth to the data during these periods allowing for interpretation of relationships between apa and actors in the port community as well as built rich narratives of issues discussed in these meetings. in table 1, we have portrayed descriptive measures of the data on the meetings. for data analysis, meetings were grouped among four phases of the covid-19 crisis for which each a network structure was created: (1) pre crisis network (20/01-01/03 -6 weeks), (2) pre lock down network (02/03-15/03 -2 weeks), (3) lock down network (16/03-29/03 -2 weeks), and (4) crisis network (30/03-12/04 -2 weeks). --table 1 . around here. --the reason why we opted for six rather than two weeks in the first period is to illustrate what we observed as a -slow start of the covid-19 pandemic that increased exponentially‖ (sciensano 2020). this aligns with the meta-data of the statistical reports of sciensano, which started issuing data only from the 1 st of march, providing a daily report from march 14 th onwards [3] . a total of four network plots (and one overview plot) are presented to provide descriptive evidence concerning overall network structures and the embeddedness of individual actors in the four phases. for each phase, we present a one-mode matrix based on the actors' list in which we weight ties between two actors based on frequency of mentioning in the records of the logbook and/or minutes of the meetings. three rounds of coding were executed in an iterative manner in which we went back and forth to the data and codes of various issues and actors involved in each crisis management and this article is protected by copyright. all rights reserved. leadership team meeting that was reported in the data. in appendix c we have provided an excerpt of data cleaning and the coding process. we aimed to minimize bias by having the first and second author agree on codes and accordingly discuss the application of codes with the third author to agree on the content of issues and the involvement of actors reported in the various meetings. simultaneously with the coding process, an actor list of apa departments was developed, indexed and pseudonymized (differentiating between operations and policy, n=18) as well as actors of the port community (n=55). the coding process was performed in microsoft excel. to calculate centralization and density scores reported in table 1 we used ucinet 6 (borgatti, everett, and freeman 2002) . to develop the network plots, we use the node-and centrality lay-out based on degree centrality analysis in the network visualization tool visone 2.7.3 (http://visone.info/; brandes and wagner 2004) . the remainder of the article is organized in three sections. in the first results section we present the findings on the structure and governance of the network of networks. we display in five figures an overall overview (figure 2), as well as more detailed views for each period (figures 3-6) to describe the development of the network structure and the embeddedness of individual actors of both the crisis management team and the leadership team of the antwerp port authority (apa) during the covid-19 pandemic. in the second results section, we elaborate on the findings of how this network of networks managed the crisis before and during the covid-19 pandemic. we close with a discussion and conclusion section in which we present recommendations for future research and practice. due to space limitations, readers can find more detail on the broader research project online. in appendix a, we have provided a background on the port of antwerp, a description of the antwerp port authority, and a more detailed account on the two levels of analysis and the this article is protected by copyright. all rights reserved. this article is protected by copyright. all rights reserved. in figure 2, we have mapped the network structures along the evolution of the covid-19 pandemic. in the first six weeks, the network structure was composed of 28 actors dealing with 17 issues in total (see also tables 1 and 2). however, in the course of merely four weeks (march 1 -29), the number of actors doubled, the number of initial ties multiplied more than five times, and together with its partners apa had to deal with 195 issues in just two weeks at its peak. in the last phase the situation stabilized (march 30 -april 4). in those periods each network structure shows various links between apa actors (purple and red nodes) and external actors (yellow nodes) from which apa derives its legitimacy (human and provan 2000) . the goal of apa was to avoid a legitimacy crisis in which they could potentially lose its formal authority of the port of antwerp (human and provan 2000) . as such, they aimed to prevent at all cost to close the port due to the covid-19 pandemic. the apa followed what human and provan (2000) term a dual legitimacy-building strategy in which personnel were provided resources and support to arrange institutions and structures of authority and collaboration such as a crisis management team on the level of operations, and a task force and nautical partners meeting on the policy level. the leadership team compared to the crisis management team was more externally focused on building outside-in legitimacy, through solidifying relationships with important stakeholders from the port community. the antwerp port authority can be considered as successful in managing the pandemic in the sense that the port kept fully operational throughout the four phases on all terminals. in this article is protected by copyright. all rights reserved. when comparing the four network structures to each other, we find that after the first six weeks a core group of actors from apa assembled. together these seven actors-purple and red nodes displayed in the center of the figures, formed what we coin a lead network (cf. provan and kenis 2008) . in analogy to provan and kenis' lead organization-a mode of governance involving a single organization acting as a highly centralized network broker to govern all major network-level activities and key decisions, a lead network from a network of networks point of view, represents a single network composed of multiple functional units from various organizations and networks that differed in (lateral) position, categories of relevant resources, knowledge, or experience, and in proportion of socially valued tangible and intangible assets or resources within the organizational field of the port antwerp (cf. harrison and klein 2007) . the comparison of structures with the issues being dealt before and during the covid-19 pandemic (table 1 ), shows that in the pre-crisis network ( figure 2 ) the departments and divisions of apa acted under business as usual. before the pandemic started, the mode of governance of apa was best described as a brokered governance which both governs the sustainment of the port community and its activities as well as participates as a broker in major port activities and key decisions (see appendix a). this corresponds with elements from a lead organization (level of operations) as well as a network administrative organization (policy level) as governance modes (cf. provan and kenis 2008). when we examine the development of apa evolving from a -pre-lock-down network‖, to a -lock-down network‖, to a -crisis network‖, we observe that the governance of the network developed into a lead network that in its core was composed of apa actors (cf. nowell and steelman, velez, and yang 2018). over the course of the pandemic, its structure evolved from a state of loosely coupled links that were targeted and appropriate (figures 3-4) , towards a state of this article is protected by copyright. all rights reserved. tightly coupled links that were stronger and more intense based on the frequency of interactions . note that although, the -crisis network‖ structure in figure 6 is highly similar in number of actors and amount of ties compared with the -lock down network‖ in figure 5 , the number of issues to be resolved by apa in collaboration with others dropped significantly from 195 to 97 in two weeks' time. once the port of antwerp entered the phase of lock-down and subsequently crisis, the functional units of apa took the lead, enacted by multiple brokerage roles allowing them pooling information and resources and working together with port community actors to guarantee the operability of the port as well as safeguard that the main sea gateway remained open for shipping. we found incidence for five types of brokerage (gould and roberto 1989 ; see appendix b) with the lead apa network selectively integrating various overlapping subgroups both within, as well as in a later stage, between functional units of various organizations and networks. the strategic orientation in brokering followed by the lead network was to collaborate in order to achieve the field-level goal: keeping the port open (cf. soda, tortoriello, and iorio 2018). during the pandemic this was exemplified by three distinct brokering behaviors: separating, mediating, and joining (stratification of) organizations and networks (grosser, obstfeld, labianca, and borgatti 2019; gulati, puranam, and tushman 2012) . the network analysis also showed that another network was created by the lead apa network to safeguard, monitor, and control the sea gate to the port during -the lock-down‖ period ( figure 5 ). in figure 5 , this network is difficult to isolate due to the density and multiplexity of the network structure (table 1) , however, in figure 6 the network is more evident in the left top corner. the inception of the network can be derived from the nautical partners meeting initiated this article is protected by copyright. all rights reserved. by apa on march 12 th . the inception of this network as an institution and structure of authority and collaboration is interesting because of several reasons. the network was highly selective in its member base, representing a limited number of actors responsible for the nautical operations in relation to the flemish and dutch ports of the scheldt estuary, including the port authorities, tugboat companies, and pilots. in line with the shared-participant network governance mode (provan and kenis, 2008) , this network of a small group of actors aligned together around one common purpose: keeping the sea gate to the ports in the scheldt river open at all costs, despite the fact that these actors are historically in competition with each other. the priority of keeping each port in the scheldt estuary open during the pandemic likely explains why they were willing to redistribute operational resources among each other as long as this safeguarded the attainment of not closing down their port. although the network was originally incepted as a temporary information diffusion network, its function altered over the course from sharing information, to problem solving, to building (inter)national capacity to address future community needs which might arise. the presidency of the meeting was handed over to the transnational nautical authority over the scheldt estuary as from april 1 st in order to be consistent with its formal authority towards external parties (i.c. shipping) and to further enhance consensus and power symmetry between the actors. by stepping down as chair, the lead apa network safeguarded that competitors remained working together. this article is protected by copyright. all rights reserved. in table 2 , we provide an overview of what apa did in terms of crisis management differentiating between the level of operations and policy. when we look at what issues were addressed in the apa meetings, we discovered a shift in attention from covid-19 as a public health issue, towards the effects of this pandemic on economy and society. further the type of issues addressed in the various meetings over the four analyzed periods suggests that covid-19 as a wicked problem was mostly perceived as a problem of -information provision‖, -decision-making‖, and to a lesser degree -sense-making‖ of the current situation apa was in (see table 1 for a summary and table 2 for details). when we retrospectively examined how apa managed the crisis in the port of antwerp throughout the covid-19 pandemic, we found several interesting matters that highlighted the idiosyncrasies of this information problem. both on the level of operations as well as policy, apa acquired, distributed, interpreted, and integrated information (flores, zheng, rau, and thomas 2012). this suggests that covid-19 was mostly perceived as an information problem because both a lack and abundance of information lead to not fully understanding the nature of the problem, which made covid-19 wicked. information was transferred through various means of communication. at some point, apa even organized webinars to ensure national and international partners of the operational readiness and continuation of the port. however, in most cases feedback and updates were exchanged within apa and with actors from the port community. apa made sure that information was present at those that needed to execute particular tasks or coordinated crisis management and communication (puranam, alexy, and reitzig 2014). this article is protected by copyright. all rights reserved. another emerging topic was their operational method resulting into a clear collective focus on the tasks at hand. this helped apa to get some kind of grip on the crisis situation. related to this was how apa developed a collective focus within the port community. internally, the crisis management team (operations) reported daily updates to the leadership team (policy) on various issues. they made sure that they collected the perception of parties that were not involved in the task force. the task force was assembled by apa to have policy level meetings with the representative bodies of the main industry and shipping stakeholders, public actors such as the federal police, the fire department, the federal public service of health and representatives of the municipality, province and regional government to ensure alignment over the whole logistic chain and the environment in which it acts. externally, apa detected early (warning) signals i.e., from the evolving situation in china due to their national and international network of ports. after verifying the signals received, apa could take informed measures to contain and manage the crisis. --table 2 . around here. ---based on being informed quickly and accurately, apa was able to take the lead and act proactively. as a response to numerous inquiries on dealing with inland navigation barges within the covid-19 context, a procedure was drafted by apa, shared for consent with the other ports in the scheldt estuary and with the authorities responsible for inland navigation on march 21st and released on april 8th after final verification with the inland navigation representative bodies. the extensive, but delaying-consent seeking, led to a unified approach towards a highly scattered subgroup, such as the inland navigation industry, which fully embraced it. this article is protected by copyright. all rights reserved. another example was how they prepared and dealt with the lock down. belgium went in lock down from 18 th of march, but at the 16 th apa was already defining what essential functions of the port needed to remain operational for traveling and transportation. apa's highperformance can be (at least) partially attributed to them being principal-driven, rather than rulethis likely explains why we found that from the 23rd of march the crisis was contained and consequently from 2 nd april apa decided to reduce the frequency of meetings. interestingly, however, when we took into account what actual solutions apa had devised to solve the pandemic, we found they made a public poster, initiated a digital campaign, launched two websites, and arranged a call center to provide a hot line to help personnel. our findings inform further research on network of networks in public administration in three ways. this article is protected by copyright. all rights reserved. although this article only examined the notion of network of networks from an egocentric point of view (apa within the port of antwerp community), we gained a first glimpse of the scale and complexity that was involved with the covid-19 pandemic. future research could in particular build on and extend this exogenous network of networks perspective, focusing on a collection of multiple networks that in some way are interdependent within an organizational field to explain why and how they might come together to potentially create a larger, more complex system like a port or city. based on the preliminary evidence presented here and building forth on the work of others, we propose two governance mechanisms that can be crucial in these explanations; first, is how a network of networks provides and motivates a collective focus by an organizational field on the problem being faced (cf. kenis this article is protected by copyright. all rights reserved. third, the findings shed some preliminary evidence for addressing anticipatory and mitigative actions among a network of high reliability organizations, i.e. fire and emergency services, police, and municipality (weick and sutcliffe 2007), and networks, i.e., the lead apa network and the network of nautical partners (berthod, grothe-hammer, müller-seitz, raab, and sydow 2016) . the concept of high reliability organizations (hro) gives directions for anticipating and containing incidents within a single organization, and focuses on maintaining a high degree of operational safety that can be achieved by ‗a free flow of information at all times' (rochlin, 1999 (rochlin, , p.1554 this research helps us to understand the response to crisis in a very specific case and context. nevertheless, several preliminary findings may be generalizable to other organizational fields such as (air)ports, cities, safety regions, health-and social care systems or innovation regions like the brainport region. for instance, one important aspect we found was the consistency of communication and the selective integration of organizations and networks with adequate monitoring and control, avoiding to impose strong constrains that limit cooperation or minimize the independence of various subsystems to be crucial. however, in some contexts like safety regions this may be at odds with common practices in crisis management among public organizations that are dominated by a strong command and control approach (groenendaal and helsloot, 2016; moynihan 2009 ). this article is protected by copyright. all rights reserved. moreover, as we are increasingly not only dealing with one specific organization, but with multiple organizational networks that are involved in ‖taming‖ a wicked problem, the findings suggests that network managers (brokers) and public policy planners (designers) need to think together of how a network of a collection of organizational networks can create, selectively integrate, and reproduce an effective complex, larger system that offers more adequate functionality and performance to match the scope and detailed nature of a problem that faces an organizational field. future research needs to determine which configuration of structure and governance of network of networks consistently achieves what field-level outcomes given the context of an organizational field. when limited diversity is present among various organizational fields, we can then start with revisiting the preliminary theorems introduced by keith provan and colleagues. this calls for further investigating various wicked problems as co-evolutionary patterns of interaction between networks and organizations as separate from, yet part of, an environment external to these networks and organizations themselves (alter 1990 to stimulate fresh thinking in practice and spur on empirical research on network of networks, our viewpoint is that: the key to the solution in how to deal with a wicked problem is to structure a system in such a way that provides appropriate incentives for collective focus and accepted article this article is protected by copyright. all rights reserved. selective integration with adequate monitoring and control, but to avoid imposing strong constrains that might limit cooperation or minimize the operational and managerial independence of various subsystems that make up this larger, complex system. in this article, we reported on how the antwerp port authority (apa) dealt with the covid-19 pandemic by examining the network structure and the embeddedness of individual actors of both the crisis management team and the leadership team. we drew upon the records and minutes of three types of meetings: the crisis management team meetings (cmt), the nautical partners meetings (nm), and the leadership team meetings (lt). the data covered a period of 12 weeks (20/01-12/04) including 53 meetings mentioning 73 unique actors involved. the network analysis revealed how the structure of lead apa network developed during various phases of the crisis. we found various indications on interdependence and emergence between apa as a lead network in a network of networks within the port community. in addition, the results show how the lead apa network governed organizations and networks in the port community. practitioners and scholars should be tentative in generalizing these preliminary findings presented here, because the data only allowed us to merely employ an egocentric network perspective based on the apa lead network. by having provided descriptive evidence concerning the development of the structure, governance, and crisis management by the apa lead network before and during the covid-19 pandemic, we hope to engage practitioners and network scholars to rethink current design and governance of organizational networks within organizational fields that have been hit by the covid-19 pandemic. it would be very promising for policy and practice to be able in the nearby future to identify what factors of a wicked this article is protected by copyright. all rights reserved. problem that faces an organizational field determines what combination of structure and governance arrangement we need to employ when, and why. this article is protected by copyright. all rights reserved. this article is protected by copyright. all rights reserved. this article is protected by copyright. all rights reserved. [1] with defining a focus, we mean a temporally unfolding and contextualized process of input regulation and interaction articulation in which a -wicked problem‖ is scanned, given meaning to, and structured by decomposing it into a set of tasks that can be divided and allocated (faraj and xiao 2006; daft and weick 1984; puranam, alexy, and reitzig 2014) . note that this process of defining or making sense is temporal and subjective in nature and a key challenge is organizing solutions towards problems of which are not fully in scope nor understand its detailed nature require that we make sense of how we understand problems (weick 1995). [2] in future research, we aim to broaden this scope by expanding the periods as well as the triangulate the egocentric perspective from the port authority with other perspectives from actors of the port community. [3] sciensano is the belgian institute for health responsible for the epidemiological follow-up of the covid-19 epidemic in collaboration with its partners and other healthcare actors. data can be accessed here: https://www.sciensano.be/en [4] port of houston, incident report, march 19th 2020; retrieved from: https://porthouston.com/wpcontent/uploads/covid19_2020_03_18_bct_bpt_incident_report.pdf tables table 1. types of data and descriptive measures of four networks (full table is this article is protected by copyright. all rights reserved. by comparing the semilattice structure (a) to a tree structure (b), it becomes clear that ‗a one is wholly contained in the other or else they are wholly disjoint (alexander 2015, p.6-7) .' this article is protected by copyright. all rights reserved. centrality layout based on node value (degree std., link strength frequency value). ties with a frequency 1-3 are displayed in light grey. ties with a frequency 4-6 are displayed in dark grey. ties with a frequency 7-9 are displayed in black ties with a frequency 10-12 are displayed in black with a larger size. ties with a frequency 13+ are displayed in black with a larger size. red nodes refer to leadership of apa. purple nodes refer to operations of apa. yellow nodes refer to partners of poa. centrality layout based on node value (degree std., link strength uniform). 28 actors with 89 ties in total are displayed. ties with a frequency 1-3 are displayed. red nodes refer to leadership of apa. purple nodes refer to operations of apa. yellow nodes refer to partners of poa. this article is protected by copyright. all rights reserved. centrality layout based on node value (degree std., link strength uniform). 32 actors with 146 ties in total are displayed. ties with a frequency 1-3 are displayed in light grey. ties with a frequency 4-6 are displayed in dark grey. red nodes refer to leadership of paln. purple nodes refer to operations of paln. yellow nodes refer to partners of poa. this article is protected by copyright. all rights reserved. legend: centrality layout based on node value (degree std., link strength uniform). 57 actors with 482 ties in total are displayed. ties with a frequency 1-3 are displayed in light grey. ties with a frequency 4-6 are displayed in dark grey. ties with a frequency 7-9 are displayed in black ties with a frequency 10-12 are displayed in black with a larger size. ties with a frequency 13+ are displayed in black with a larger size. red nodes refer to leadership of apa. purple nodes refer to operations of apa. yellow nodes refer to partners of poa. accepted article figure 5 . crisis network centrality layout based on node value (degree std., link strength uniform). 53 actors with 416 ties in total are displayed. ties with a frequency 1-3 are displayed in light grey. ties with a frequency 4-6 are displayed in dark grey. ties with a frequency 7-9 are displayed in black red nodes refer to leadership of apa. purple nodes refer to operations of apa. yellow nodes refer to partners of poa. this article is protected by copyright. all rights reserved. since 2018, the antwerp fire service (brandweer zone antwerpen or bza) and antwerp management school (ams) have been joining forces in an applied research project to develop a future vision on the organization of emergency services in antwerp (hence the involvement of the fourth and fifth author, being the fire chief and company commander). in this project, we examine how to organize larger, complex systems for collective, field-level behavior; how organizations and networks in an organizational field can be selectively integrated-so that those networks that need to work together do so, while others that not need to work together, do not; and what institutions and structures of authority and collaboration we need that provide network managers (system designers) the means to create, collectively define, integrate, and dissolve a network solution to organizational field challenges such as public safety and health. this article is protected by copyright. all rights reserved. with 235mio tons of maritime freight volume (in 2019), the port of antwerp is the second largest port in europe. stretching over a surface of 120 km² and centrally positioned some 80km inland it has 60% of europe's production and purchasing power within a 500km radius. under the authority of the federal and flemish government, the port area stretches over two provinces with the scheldt river in between. apart from the port of antwerp, the scheldt estuary houses another three seaports, of different size and government structure, and their ancillary services. nearly 62,000 persons work in the port community, which also contains the second largest petrochemical cluster in the world. this positions the port of antwerp as one of the main gateways to europe. not only is the port critical to the logistics required to support governments in their attempts to reduce the effects of the covid-19 pandemic, it is also a vital infrastructure that allows for continued economic activity in europe in general and belgium in particular. the antwerp port authority (apa) is a public limited company under public law, fully owned by the municipality of antwerp. apa plays a key role in the port's day-to-day operations. on the level of operations, we found three actors that played a central role in the network of networks. these actors were the department harbor, safety and security (op/hss), the vessel traffic department (op/vt), and the safety and health department (ad/sh). these three departments acted as the executives of the port by controlling and monitoring the port community. based on participant observations by the second author, we also had access to data on the operations director (op) who has been heavily involved in managing these departments on the policy level-four actors were found to be involved with the port governance and this article is protected by copyright. all rights reserved. the director of operations (op) oversees the nautical operations department, which is responsible for the fleet and the above mentioned shipping traffic management and harbor safety and security. asset management and port projects respectively deal with the development and management of the dry and maritime structure and with technical projects that have an impact on port infrastructure. sustainable and balanced market for goods and services. this agency was more centrally involved with apa as it is in charge of the quality control and distribution of medical equipment and personal protective equipment (ppe). the second category involves the port stakeholders. in general, we identified five types of stakeholders in the port: industry, shipping, services, road-, and rail transportation. based on table 1 and figure 1, we found that the industry stakeholders and the shipping stakeholders were more mentioned in the covid-19 crisis meetings. note that next to those two stakeholders, there are also the inland navigation owners, operators and representatives, who are more scattered and are often smaller businesses. in belgium, this represents merely 1150 vessels with a total capacity of 1,8mio ton and around 1800 persons, a majority self-employed. apa alone already handles about 99,3 mio ton goods for over 52000 inland navigation vessels annually, which emphasizes the international context of this stakeholder segment (2019). the industry stakeholders comprise all companies that are based in the port (900 companies in approximation). these include terminal operators (containers, liquid, dry bulk, etc.) and chemical production companies, often subsidiaries of multinational companies within the port area. they have a commercial relationship with apa, being concessionaires and apa being the landlord. the shipping stakeholders on the other hand, are those that own, manage, operate or represent the shipping lines. this includes shipping companies but also agencies and representative bodies. their commercial relationship with apa takes the form of port dues. the final category represents the nautical service providers that act as the pilots for the different sections of the river scheldt, the dock pilots, helmsmen, boatmen and other supportive this article is protected by copyright. all rights reserved. par covid-19 viewpoint symposium 58 58 services that ensure safe navigation from sea to port and vice versa. these service providers closely collaborate with the operations department of apa. this article is protected by copyright. all rights reserved. we provide for each brokerage type (structural position) of gould and roberto (1989) a short qualitative account to illustrate the veracity of this dynamic process of brokerage conducted by the apa lead network. the strategic orientation in brokering followed by the lead network was to collaborate in order to achieve the field-level goal: keeping the port open (cf. soda et al., 2018) . during the pandemic this was exemplified by three distinct brokering behaviors: separating, mediating, and joining organizations and networks (cf. grosser et al., 2019) . (2) itinerant broker between subgroups within the port community: mediation. in this role functional units of the lead apa network acted as a mediator between two subgroups of the community. in one example they mediated a concern regarding a parking lot dedicated to trucks. whereas the parking was closed in agreement with the port police force on march 20th, it was reopened with additional enforcement measures on march 26 th after extensive dialogue between representative bodies of the road transportation industry and the police force. this could not have been the case without mediation of the lead apa network between conflicting subgroups of the port community. this article is protected by copyright. all rights reserved. (3) gatekeeper of the port community: mediating. in this brokerage role the lead apa network in close collaboration with the federal agency saniport (responsible for sanitary control of international traffic) which ships were allowed to enter. they acted as a go-between controlling access from the sea to the land. apa preventively assigned a lay-by berth as quarantine area for suspected ships and the apa harbour master office played a key role in authorizing suspected vessels to berth and under what conditions, informing the right parties and providing conditions to leave port after being cleared following an infection. in its gatekeeper role on several occasions the lead apa network needed to switch to mediation when disputes between the different actors (ships, service providers, shore industry, etc.) over preventive measures need to be promoted for reconciliation-always with the aim of safeguarding port operations whilst protecting the health of those involved. (4) representative of the port community: mediating. although during the covid-19 pandemic formally no additional authority was assigned to apa, its primary legitimacy base derived from its central position as a core provider of services to the industry and safeguarding the shipping interests reaching far beyond the local and regional economy of antwerp. this also means that apa as a broker represents the port community for instance illustrated by a press release on the task force on april 2nd: -at the moment port of antwerp is not experiencing any fall in the volume of freight. in fact there is a noticeable increase in the volume of pharmaceuticals and e-commerce. the supply of foodstuffs is also going smoothly. on the other hand there has been a fall in imports and exports of cars and other industrial components due to various industries closing down.‖ 2008 ). by limiting the number of network participants the lead apa network was able to create a narrow orientation for the network purpose: safeguard access to the seaport for shipping. in addition, to increase effectiveness the lead apa network mediated by handing over the presidency of the meeting to the transnational nautical authority over the scheldt estuary as from april 1 st in order to be consistent with its formal authority towards external parties (i.c. shipping) and to further enhance consensus and power symmetry between the actors. this article is protected by copyright. all rights reserved. strategic alliance structures: an organization design perspective environments of organizations a city is not a tree. 50 th anniversary edition an exploratory study of conflict and coordination in interorganizational service delivery system making things work: solving complex problems in a complex world from high-reliability organizations to high-reliability networks: the dynamics of network governance in the face of emergency ucinet for windows: software for social network analysis analysis and visualization of social networks the lost art of planning a search for beauty: a struggle with complexity: christopher alexander structures of mediation: a formal approach to brokerage in transaction networks a preliminary examination of command and control by incident commanders of dutch fire services during real incidents measuring mediation and separation brokerage orientations: a further step toward studying the social network brokerage process meta-organization design: rethinking design in interorganizational and community contexts what's the difference? diversity constructs as separation, variety, or disparity in organizations wicked problems in public policy legitimacy building in the evolution of small-firm multilateral networks: a comparative study of success and demise how organizational field networks shape interorganizational tie-formation rates par covid-19 viewpoint symposium 45 appendix c note. due to sensitivity of data we do not show raw data and cleaned data.accepted article key: cord-322815-r82iphem authors: zhang, weiping; zhuang, xintian; wang, jian; lu, yang title: connectedness and systemic risk spillovers analysis of chinese sectors based on tail risk network date: 2020-07-04 journal: nan doi: 10.1016/j.najef.2020.101248 sha: doc_id: 322815 cord_uid: r82iphem abstract this paper investigates the systemic risk spillovers and connectedness in the sectoral tail risk network of chinese stock market, and explores the transmission mechanism of systemic risk spillovers by block models. based on conditional value at risk (covar) and single index model (sim) quantile regression technique, we analyse the tail risk connectedness and find that during market crashes, stock market exposes to more systemic risk and more connectedness. further, the orthogonal pulse function shows that herfindahl-hirschman index (hhi) of edges has a significant positive effect on systemic risk, but the impact shows a certain lagging feature. besides, the directional connectedness of sectors shows that systemic risk receivers and transmitters vary across time, and we adopt pagerank index to identify systemically important sector released by utilities and financial sectors. finally, by block model we find that the tail risk network of chinese sectors can be divided into four different spillover function blocks. the role of blocks and the spatial spillover transmission path between risk blocks are time-varying. our results provide useful and positive implications for market participants and policy makers dealing with investment diversification and tracing the paths of risk shock transmission. in recent years, financial markets have become extremely volatile, especially the global financial crisis in 2008 and the continued global plunge in global stock markets caused by the covid-19 1 in 2020. this has drawn lots of attention from academia trying to measure systemic risks and grasp the system risk spread across sectors or markets. there is some evidence that financial systemic risk threatens "the function of a financial system" (european central bank (ecb), 2010) and impairs "the public confidence or the financial system stability" (billio et al., 2012) . it is widely observed that systemic risk spillovers have a significant "production-contagion-re-contagion" patterns. for the interconnectedness within a market, once one sector encounters a risk shock, the risk will affect other sectors through strong linkages and contagion mechanisms, and even spread to the entire financial markets. in this context, investigation into the connectedness among financial markets and the systemic risk spillovers contagion mechanism across sectors or markets become important and necessary, which is helpful for regulars to identify sources of risks and formulate intervention strategies, and for investors to make smarter portfolio strategies. the complex relationship between financial markets and their internal elements is the carrier of systemic risk transmission, and their connectedness patterns or structures play an important role in the formation and infection process of systemic risks. moreover, the concept of systemically important financial institutions (sifis) can be extended to broader markets or sectors. some scholars find that sectors have different response to shocks due to their own market and sectoral heterogeneity and risk features (ewing, 2002; ranjeeni, 2014; yang et al., 2016; wu et al., 2019) . for stock market participants, sectoral indexes can be used as a significant indicator to access portfolio performance. identifying which sector is the most influential and how systemic risks spillover among sectors is essential for effective risk management and optimal portfolios. therefore, in addition to cross-institution risk spillovers, cross-sector and cross-market risk transfer has become increasingly prominent. it not only greatly increases the probability of cross-contagion of financial risks, but may also trigger broader systemic risks. referring the knowledge of sifis, this paper measures the systemic risk of each sector in chinese stock market, analyzes the spatial connectedness of various sectors to determine which sectors play a leading role in risk spillover or market co-movements, and explores the risk spillover transmission paths. the results of this study could manage the systemic risk and preserve financial stability, which in turn, contribute to the smooth functioning of the real economy. although its (cross-sectors) great important, the existing literature on this topic is relatively scarce. this study we first develop and apply the tail risk network with the single-index generalized quantile regression model of h rdle et al. (2016) , which takes into consideration non-linearity a relationships and variable selection. further, we investigate the tail risk network topology and its dynamic changes to analyze the spatial connectedness of 24 chinese sectors during 2007-2018. in order to understand the impact of network connectivity on systemic risk, we draw on the orthogonal pulse function to find that hhi of edges has a significant positive effect on systemic risk. second, we adopt pagerank index to identify systemically important sector. it is observed that although the systemically important industries are time-varying, and the utilities and financial sectors (including banks, insurance and diversified finance) still should be received more attention. finally, we innovatively use block models to assess the roles of different spillover blocks, and excavate the transmission paths of risk spillover in different blocks. the remainder of the paper is organized as follows. in section 2 related literature about our study is outlined. section 3 shows the data and methodology. section 4 is the empirical results. section 5 presents our conclusion. systemic risk threatens the stability and functioning of financial markets when the stock market is confronted with sharp downtrend, reduced with market confidence and willingness of risk taking. and it is considered to be the risk of causing a large number of participants in the market to suffer serious losses at the same time and quickly spread into the system (benoit et al., 2017) . a number of researchers have discussed the measurement of financial system's systemic risk and the macroprudential risk management approaches (laeven et al., 2016; acharya et al., 2012) . the relevant literature in this filed can be roughly divided into four categories: the first is conditional value-atrisk (covar). adrian and brunnermerier (2007) put forward the covar, defined as the var of the financial system when a single market or sector encounters some specific events. and then they proposed a measure of systemic risk, δcovar (adrian and brunnermerier, 2016), which is defined as the difference between covars when a sector is and is not under turmoil. girardi and ergun (2013) used covar approach to measure the systemic risk contribution of four financial sectors composed by lots of institutions, and investigated the relationship between institutional characteristics and systemic risk contributions. the second method emphasizes the default probability of financial institutions through the interrelationships between financial assets. for example, principal component-based analysis, e.g. kritzman et al. (2011 ), bisias et al. (2012 , rodriguez-moreno et al. (2013) and others; cross-correlation coefficient-based analysis, e.g. huang et al (2009 ), patro et al. (2013 . the third category uses the copula function to calculate the systemic risk with biased tail of stock market. krause et al. (2012) adopted copula function to calculate the nonlinear correlation of time series and established an interbank lending network. the results show that externally failed banks can trigger potential banking crisis and analyze the spread of risks within the banking system. the last category looks at an institution's expected equity loss when the financial system is suffering losses. acharya et al. (2017) proposed marginal expected shortfall (mes) and systemic expected shortfall (ses), which are two systemic risk measures. further approaches take into the information of market capitalization and liability, such as, the srisk (brownlees et al., 2017) and the component expected shortfall (ces) (banulescu et al., 2015) . both the srisk and ces methods especially focus on the interdependence between a financial institution and the financial system, and ignore the interconnectedness among financial agents from a whole system perspective. however, as pointed out by bluhm et al. (2014) macro-prudential monitoring is still at a very early stage, quantifying the magnitude of systemic risk and identifying the transmission paths need more scientific analysis. to do so we apply network methodology to quantify the interconnectedness among sectors in financial system. network theory has always been a leading tool for analyzing the intricate connectedness relationship because it can conquer the "dimension barrier" of multivariate econometric models and simplify complex financial systems (acemoglu et al., 2015; battiston et al., 2016; huang et al, 2018) . and in financial network, financial entities (e.g. institutions, sectors and markets) are abstracted to nodes, and correlations among agents are abstracted to edges. the early literature on classic network construction methods is correlation-based networks, such as the minimal spanning tree (mst) (mantegna, 1999) , the planar maximally filtered graph (pmfg) (tumminello et al., 2005) and the partial correlation-based network . the main disadvantage of the correlation-based network is that the economic or statistical meanings of their topological constraints are unclear (onnela et al., 2003; kenett et al., 2015; zhang et al., 2019) . more recently, several econometric-based networks have been constructed to uncover information spillover paths and contagion sources (výrost et al., 2015; ly csa et al. 2017; belke et al. 2018) . o the extensively econometric-based networks are classified into three groups: (i) mean-spillover network (also called granger-causality network), which is proposed by billio et al. (2012) ; (ii) volatility spillover network, e.g., the variance decomposition frame-based network of diebold and yilmaz (2014) , and the garch model-based network of liu et al. (2017) ; (iii) risk spillover network, which major includes tail-risk driven network of hautsch et al. (2015) and h rdle et al. a (2016), and extreme risk network of wang et al. (2017) . of course, many studies have discussed the application of spillover networks. this study is distinguished from existing information spillover literature by focusing on systemic risk spillover, especially tail risk spillover. the last strand is associated with the tail risk spillover network and its applications. hautsch et al. (2015) used the least-absolute shrinkage and selection operator (lasso) method to build a tail risk network for the financial system, and evaluated the systemic importance of financial firms. the 2016) and is constructed by semiparametic quantile regression framework that considers non-linearity and variable selection. they have discovered the asymmetry and non-linear dependency structure between financial institutions and identified systemically important institutions. wang et al. (2017) applied caviar tool and granger causality test to measure systemic risk spillovers, and then proposed an extreme risk spillover network for studying the interconnection among financial institutions. our work contributes to the literature in three major aspects. first, we analyze the characteristics of spatial connectedness and systemic risk spillovers of tail risk network using sectoral data in chinese stock market. we extend the literature on interconnectedness and systemic risk of sectors level data while extant literature generally focuses on the financial institutions data. second, we innovatively adopt orthogonal pulse function to explore the impact of network connectivity on systemic risk of financial system. besides, we employ pagerank index to identify systemically important sectors that spread systemic risk spillovers to entire system. third, we apply block model in our study to assess the roles of different spillover blocks of 24 sectors in risk contagion process, and excavate the tail risk transmission paths and contagion mechanisms. the existing literature focuses more on the network topology and the identification of important financial institution nodes in the financial institution network, but lacks the risk propagation mechanism analysis. importantly, it is necessary to clarify how system risk transfers across sectors. in order to analyze the systemic risk spillovers and its interconnectedness across chinese sectors, we select the weekly closing prices of 24 sectors in china's stock market (name abbreviations , of 24 industries are seen in appendix table a1 ). the sample data ranges from january 4, 2007 to december 31, 2018 (total of 613 trading weeks), and the industry classification data is available from wind database. our analysis centralizes the weekly returns of each sector, which is defined as . table 1 presents the descriptive statistics for weekly returns of 24 sectors , = ln ( , / , -1 ) during the sample period. note that maximum of return series except for pbls and df, is less than the absolute minimum, implying that there is extreme risk in the left-tail of the yield distribution. besides, the jb statistics for each sector is significant at 1% level that rejects the null-hypothesis of gaussian distribution for the series. thereby, we can use single-index model (sim) quantile regressions to estimate the covar. apart from the closing price data, motivated by , we also collect five macro state variables and four internal variables. the macro state variables contain the weekly market returns, the market volatility, the real estate sector returns, the credit spread and liquidity spread, which depict the economic situation. the internal variables contain the size, the turnover rate, the p/e ratio and p/bv, which reflect the influence of industry from the fundamental characteristics. the detailed definition of these variables can be seen from table 2 . notes: *** denote significance at 1%. table 2 variable definitions. weekly market returns calculated by shanghai securities composite index. defined as the conditional variances of the shanghai composite index returns estimated by garch(1,1) real estate sector returns the weekly logarithmic yield of real estate index. credit spread difference between 10-year treasury bond and aaa-rated corporate bond yield liquidity spread difference between three-month shanghai interbank offered rate and three-month treasury bond yield. size defined as the turnover, which is equal to the volume multiplied by the average price. the weekly turnover rate can be available from wind info. p/e ratio the weekly price earnings ratio. p/bv calculated by price/book value. in this paper, we adopt the novel tenet framework proposed by h rdle et al. (2016) to a measure the tail risk interconnectedness among various industries and build dynamic evolution tail risk network in china's stock market. as we all known, adrian and brunnermeier (2016) only perform linear interaction between two financial institutions, however, chao et al. (2015) find that any two interacting financial assets show non-linear dependency, especially in uncertain economic periods. therefore, accounting for non-linearity dependency, h rdle et al. (2016) develop the a bivariate model to a high-dimensional state and solve the variable selection problem by single-index quantile regressions. accordingly, we also exert three estimation steps to complete tail risk network's construction. first step, the var for each industry i at quantile level should be first estimated by ∈ (0,1) using linear quantile regression. given the return of industry i at time t: represents the macro state variables, is the estimated parameters. -1 second step, the covar is the basic element of the network, and it can reflect the systemic (tail) risk interconnections of sectors. the tail risk interconnectedness from one industry to another in the tail risk network stands for the systemic risk contagion and network spillovers. thus the covar should be second estimated. we perform a risk connectedness analysis by accounting for nonlinear dependency in high dimensional variables, and adopt sim quantile regression technique to gain the systemic risk contribution due to a change in the relevant industry. it is obtained via 2 : , -1 , , -1 } -, = { 1, , 2, ,…, , } independent variables including the returns of all industries apart from industry j; n denotes the number of sectors; is the internal features of every industry, i.e., size, turnover rate, p/e , -1 ratio, p/bv; is the parameters, and . the represents network risk triggered by tail-event, which includes of all other relevant industries on industry j and the non-linearity that is reflected by the function quantifying the marginal effect of covariates, and is the componentwise where could reflect the risk spillover effects among sample industries. note that we only centralize the partial derivatives of industry j on the other industries ( ) in given network. additionally, we can also use rolling window estimation to estimate all |parameters. last step, the directed tail risk network should be constructed. it is denoted as graph g (v, e) with a set of nodes and a set of edges e. the adjacency matrix with all = { 1 , 2 ,…, } linkages at window s is to be: (7) where denotes the name of industry i. the absolute value of is the element of weighted matrix, and it | is the risk connectedness from industry i to j. (1) network concentration concentration is also an important indicator of network structure and represents the density of the linkages. following fang et al. (2019) , we apply the herfindahl-hirschman index (hhi) that is generally used to measure the extent of concentration in an industry. hhi index equals the sum of the square of market share of each financial institution, and can be used to measure the degree of monopoly. thereby, the hhi index can reflect the degree of risk network concentration, which is consistent with our definition. it is calculated by: where is the number of edges connected for the node i at window s, is the total number of network edges, and denotes the proportion of connected edges of node i to in window , which stand for the degree of node i's relative linkages. (2) node strength the node strength considers not only the number of directly connected edges but also the weights of edges. it can be seen from the adjacency matrix of formula (7) that risk spillover is directional. therefore, this article pays more attention to the sectors that spread or absorb risks, that is, the out-strength (in-strength) is used to measure the risk contagion (absorption) ability of each sector. now, we introduce two directional measures of the sector strength, i.e., the out strength and the in strength, and these are used to measure each sector's outgoing and incoming connectedness, respectively. the out-strength (os) of sector i is the sum of weights of outgoing edges from | | | sector i to other sectors, as follows: in-strength (is) of a sector i is the sum of weights of incoming edges from other sectors to | | | sector i, as follows: (3) pagerank assume that node i has a direct link to node j, the more important of node i is, the higher the contribution value of node j is. thereby, the pagerank reflects the connectedness between one industry and another while considering the influence ability of its neighbors. pagerank algorithm is a variant index of the eigenvector centrality in "adjacency matrix". as in wang et al. (2019) , we compute the pagerank (pr) indicator through the iterative method which introduces a dynamic process of information spread. first, we compute the centrality value of sector i based on the risk network matrix (eq. (7)). and the effect weight is normalizing as follows: where denotes the effect weight by sector i on sector j at window s. second, we adopt the , pagerank algorithm proposed by page et al. (1999) to get pagerank: , where d is a damping factor (generally set to 0.85), is the pagerank of sector i and its value always positive. a higher value means that sector i has a greater contribution to systemic risks of network. the block model is the main method for spatial cluster analysis of the complex financial networks. it is first proposed by white et al. (1976) , which is a method of studying network location modules and is to view social life as an interconnected system of roles. later, scholars conducted in-depth research and promotion of this concept from many aspects. in addition, many scholars also use the "block model" to study some specific issues, such as the study of the scientific community (breiger, 1976) , the world economy (snyder et al., 1979) , the organizational issues and the regional contagion effects (shen et al., 2019) . in short, the concept and the method of the block model have been widely used. therefore, the block model could identify the aggregation characteristics among individuals to divide the network into location blocks. actually, this method not only determines the members included in each block, but also analyzes the role played by each block in the risk propagation process, and explores the risk spatial propagation path (li et al., 2014; zhang et al., 2020) . there are four role blocks: (i) main benefit, members of this block receive links not only from external members but also from their own members, and the proportion of internal relations is large, while the proportion of external relations is small. in extreme cases, it is called isolated block, that is, the block has no connection with outside. (ii) main spillover, members of this block send more links to other blocks, but send less links to inside members, and receive less links from external. (iii) bilateral spillover, its members send more links to their own members and other blocks' members, but receive very few external links from other blocks. (iv) brokers, its members both send and receive external relationships, while there are fewer connections between their internal members. motivated by wasserman et al. (1994) , we analyze the relationship of each member from block by the evaluation indicators shown in table 3 . there are nodes in block , then the number of possible relationships inside is . the entire network contains n nodes, so ( -1) all possible relationships among members in are . in this way, we expect the total ( -1) relationships expectation ratio of the block to be . ( -1)/ ( -1) = ( -1)/( -1) table 3 four types of blocks. received linkages ratio internal linkages ratio ≈ 0 > 0 ≥ ( -1)/( -1) bilateral spillover main benefit < ( -1)/( -1) main spillover brokers in this part, we apply sliding windows to estimate time-varying var, covar and construct dynamic evolution tail risk network. we use linear or non-linear quantile regression model to estimate var and covar at the quantile level 3 , and the sliding window size is set to be = 0.05 trading weeks (corresponds to one year's weekly data). though the way, we get whole = 50 . to obtain the preliminary analysis about the whole sample dataset, we present = 563 the log returns and covar of 24 sectors, and the dynamic evolution about the total connectedness and average lambda of tail risk network from 2008-01-04 to 2018-12-31 (window size w=50, w=563). from fig. 1 , we can find that chinese stock market ( fig.1 (bottom) shows the dynamic evolution of the total connectedness of the tail risk network. we observe that the value appeared twice λ obvious peaks, corresponding to the us subprime mortgage crisis in 2008 and the domestic stock market turmoil in 2015. however, the total connectedness of tail risk network had at least five peaks, corresponding to the us subprime mortgage crisis in 2008, the european debt crisis in 2011, the money shortage in 2013, the stock market turmoil in 2015 and the trade friction between us and china in 2018. this phenomenon reflects that the total connectedness of tail risk network is more sensitive to the shock of chinese stock market, and may be an alarm before the market turmoil. in this section we first measure the network edge concentration to reflect the overall connectivity of chinese sectoral tail risk network, and investigate the impact of network edge concentration on systemic risk at the global level. fig.2 shows the dynamic evolution trend of the sectoral tail risk spillover network edge concentration. from fig.2 we can see that the sectoral tail risk network edge concentration (hhi) has apparent periodic variation characteristics. this finding is basically consistent with the periodic evolution of systemic risk in the time dimension. further, the first and last peaks are most notable, which correspond to the 2008 financial crisis and the 2015 domestic stock market turbulence. now, we take the period (2015/1/30-2016/12/30) of the last peak as an example. in this period, the most significant change of the hhi value is a rapid climb from 0.187 (may 2015) to 0.228 (july 2015). as the potential risks continue to accumulate, the concentration of edges reaches a maximum of 0.232 (january 2016). in the early stage, the chinese government issued a series of economic reform measures, which stimulated investors' blind optimistic expectations. besides, large-scale funds of financial institutions entered the stock market through the way of "highly leveraged" off-market allocation, and the excessive risk-taking behavior of different types of firms in the stock market has led to an increase in indirect correlation. gradually, due to the downward pressure of china's macro economy and the strict investigation of off-market allocation by the china securities regulatory commission, a large-scale withdrawal of credit funds and an avalanche-like chain reaction led to the 2015-2016 stock market crash. this shows that as the market turbulence intensifies, the concentration of the risk network will increase, and the edges of the entire network are mostly concentrated in a few highly centralized sectors. at this time, the stability of the network structure is very poor. if these nodes encounter a risk shock or infection, the systemic risk will quickly spread throughout the network, and the risk spillover effect between sectors will be significantly strengthened. conversely, as the risk is released, the market gradually stabilizes or rises, and the hhi indicator value will become smaller. this phenomenon indicates that the tail risk network exhibits the characteristics of multi-centers rather than a central node. the multi-center network structure facilitates the dispersal of risk information through multiple channels and is conducive to maintaining the stability of the stock market network. ) is the source of shock. here, we adopt orthogonal pulse function to test the hhi short-term dynamic relationship between network edges concentration and systemic risk. this method is widely used to analyze the relationships between variables (pradhan, 2015; berument et al., 2009 ). the pulse function not only presents the direction of the influence, but reflects significance level and time lags. fig.3 depicts the response of systemic risk to network edges concentration. in fig.3 , the vertical axis denotes the systemic risk for the same sector, while the horizontal axis denotes the time lag after the shock in the sample sectors for that month. it is observed that the hhi of edges initially has no significant positive effect on systemic risk. with the accumulation of risks, hhi begins to shows a positive effect on systemic risk from the second month, and reaches the maximum in the fourth month. gradually, variable hhi disappears to meaningless after nine months. the result shows that hhi has a significant positive impact on systemic risk, but the impact shows a certain lagging feature. the reasonable explanation for this phenomenon is that as the hhi value increases, the connected edges in the network are more controlled on a few central nodes. so, the systemic risks of the network are cumulatively amplified. however, the characteristics of systemic risk "slow accumulation and rapid release" and the shortcomings of chinese financial market under severe macro-regulation are important reasons for the lagging effect. of course, it provides strong evidence supporting the results in fig.1 , which proposes that the concentration of the risk network is more sensitive to the cumulative systemic risks. in addition to analyzing the overall connectedness of the tail risk network, we also analyze the weighted and directed edges of individual industry nodes. fig.4 and fig.5 reflect the dynamic evolution of the risk propagation and risk absorption of each sector during the entire period, respectively. first of all, we can see that both the ability of risk propagation or risk absorption of each sector change over time. many in-strength values are less than one, and only a few sectors have larger values (see fig.4 ), suggesting that these few sectors are seriously infected by external shocks and receive the highest tail risk. in the first shock event period (2008/1/25-2009/12/31), four sectors, i.e., business and professional services (bps), media (med), home and personal items (hpi), healthcare equipment and services (hes) have the largest in-strength values and are the top receivers of tail risk. the results show that the systemic risk from us subprime mortgage crisis has seriously shocked china's real economy sectors, and these industries have accumulated more tail risks. in the second extreme event period (2010/1/29-2012/12/28) that covers the european sovereign debt crisis, the healthcare equipment and services (hes), software and service index (ss) and semiconductor and semiconductor production equipment (sspe) receive the largest incoming links. during the "chinese stock market turbulence" period (2015/1/30-2016/12/30), the strong incoming links come from business and professional services (bps), media (med), software and service index (ss) and semiconductor and semiconductor production equipment (sspe). in the "trade friction between us and china" period (2017/1/26-2018/12/28), five sectors, i.e., software and service index (ss), semiconductor and semiconductor production equipment (sspe), insurance (ins), utilities (ut) and business and professional services (bps) have the strong incoming links, showing that these sectors are the most affected by tail risk. this finding supports the evidence that in 2018, the us imposes trade sanctions on various industries' commodities in china, including: communications, electronics, machinery and equipment, automobiles, furniture and so on, which corresponds to the above-mentioned industry classification. hence, the greater the in-strength value, the deeper the bad impact of a sector by other sectors, and more serious of the damage. out-strength fig. 5 out-strength of each sector for dynamic tail risk network notes: the horizontal axis (x) denotes time windows, and the vertical axis (y) denotes the abbreviation code of sectors (the corresponding full name of each code is presented in appendix table a1 ). as can be seen from fig.5 , the distribution of the out-strength differs from that of the in-strength and is relatively even. for example, many sectors have the lower out-strength value, and only a few out-strength values are larger 4, indicating that the few sectors emit the highest systemic risk. in the first event period (2008/1/25-2009/12/31), the strong connected sectors with outgoing links are energy (ene), diversified finance (df), insurance (ins), and utilities (ut). it indicates that affected by the us subprime mortgage crisis, these industries are the main senders of tail risk. in the third event period (2013/1/25-2014/12/31), which covers the money shortage in 2013, the home and personal items (hpi), media (med), and diversified finance (df) send the largest outgoing links to others. one of the reasons for the sector of home and personal items with a high level of outgoing links is that the reduction of currency circulation in the market directly reduces the daily consumption level of consumers. in the fourth event period (2015/1/30-2016/12/30) which covers the "2015-2016 china stock market turbulence", two financial sectors including bank (bank) and diversified finance (df), and media (med) have strong outgoing links and are involved in most risk spillovers. this phenomenon proves that financial institutions (especially security sector) trigger the recent bear market. overall, the greater the value of the out-strength, the stronger the ability of one sector to spread the tail risk to other sectors, and the greater the impact on others. connectedness alone cannot stand for the systemic importance of an individual sector. we thus calculate the pagerank index since it considers both the interconnectedness and the influence ability of neighbor nodes. to achieve a comprehensive knowledge of the systemic importance for each sector, we draw the heatmap of pagerank value which is shown in fig.6 . obviously, the influence of different industries in different periods varies greatly. from fig.6 , we observe that the pagerank value of most industries is less than 0.05, while only a few sectors have high hhi, showing that these sectors could act as influential sector in chinese stock market. for example, in first risk event period, the top three sectors are utilities (ut), diversified finance (df) and media (med), which are thus systemically important. and utilities (ut) and insurance (ins) are the systemically important sectors in second risk event period. the most important reason why ut becomes a systemically important industry is that the utility industry provides infrastructure protection for the development of other industries. furthermore, in third risk event period the home and personal items (hpi) and diversified finance (df) have larger pagerank value and are thus the largest tail risk contributors during that period. in fourth risk event period only diversified finance (df) consistently presents higher pagerank value. one of the major reason for diversified finance being the influential sector is that large-scale abnormal securities margin transactions have caused a surge in systemic risk under unregulated conditions, which in turn affects many associated industries due to asset-liability relationships or high leverage. at the end of 2017, utilities (ut) and energy (ene) are systemically important sectors. as mentioned above, the utilities and financial sectors should be received more attention in the overall time period from both regulators and investors as they become systemically important sectors in many risk event periods. therefore, in the near future, the dependence of utilities industry not reduce significantly, which may reinforce the utilities stocks. besides, for financial sectors, the development of the whole industry depends much on balancing financial structure, strengthening financial regulation and improving financial innovation. the most fundamental reason is that the financial sector is an important sector in the national economy. it has the characteristics of high industry linkage and strong driving ability. it provides financial support for the development of enterprises in many sectors. once the financial industry is in a downturn, it will affect the development of the entire industry chain. fig. 6 heat map of the pagerank value of each sector for dynamic tail risk network. notes: the horizontal axis (x) denotes time windows, and the vertical axis (y) denotes the abbreviation code of sectors (the corresponding full name of each code is presented in appendix table a1 ). this section divided 24 sectors into different blocks through block model, and find out which sectors are likely to cluster in the same community, and then further to examine the relative roles of each block in the sectoral tail risk network. this method can more simply and clearly reflect the function of various industries and risk propagation paths in the risk spillover process. and it is more conducive for the regulatory authorities and investors to grasp risks transmission mechanism, formulate risk prevention measures and optimize asset allocation strategies. here, we conduct a segmented sample study which covers five sub-samples: period 1 is us subprime crisis from 2008 to 2009; period 2 is european debt crisis from 2010-2012; period 3 is money shortage period from 2013-2014; period 4 is 2015-2016 chinese stock market turbulence; period 5 is trade friction between us and china from 2017 to 2018. according to existing research practices (chen et al., 2019; zhang et al., 2020) , we used the ucinet software to divide the block position of the tail risk network adjacency matrix. and, in this process we choose maximum separation depth is 2 and the convergence criterion is 0.2. therefore, we get four risk spillover blocks in five sub-samples. table 4 presents the spatial connectedness and role analysis between risk blocks of sectors in five sub-samples. from table 4 , it is observed that there are significant differences in the roles played by the four major blocks and the features of different blocks vary across time. now we take the period 1 and period 5 as examples to analyze risk spatial linkages of 24 sectors. specifically, in the period 1 and 5, the internal linkages between the four blocks are 69 and 28, respectively, while the cross-linkages between four blocks are 67 and 88, respectively. it indicates that the spatial spillovers between four blocks are very obvious. in period 1, the number of sending relations in first block is 13, of which there are 5 relations within the block, and the receiving relations from other blocks are 19; the expected internal relation ratio is 13.04%, and the actual relation ratio is 38.46%, so it is called "main benefit block". members of first block are ene, ut, bank and re, indicating that the tail risk spillovers between these sectors are closely linked and they are easily affected by the external risk shocks. furthermore, the number of sending relations in second block is 22, of which there are 9 relations within the block, and the receiving relations from other blocks are 12; the expected internal relation ratio is 13.04%, and the actual relation ratio is 40.91%, so it is called "bilateral spillover block". similarly, the third and fourth block are all "bilateral spillover block". members of the second block are tsp, cs, df and ins, showing that if fluctuations generated by these sectors, there will be great subsequent fluctuations to other sectors. e.g., the transportation industry (tsp) is an upstream industry for many industries, and when risks occur, the risks are transferred to other industries through sector linkages. overall, the internal links ratio of the first and second blocks is low, while the ratio of the third and fourth blocks is high and the third-fourth block emit more links with each other. in period 5, the number of sending relations in first block is 46, of which there are 16 relations within the block, and the receiving relations from other blocks are 17, including 10 links from fourth blocks; the expected internal relation ratio is 26.09%, and the actual relation ratio is 34.78%, thereby it is called "bilateral spillover block". the sending links in second block is 4, and the internal links of this block is 0, while only 1 link send to fourth block; so the actual relation ratio of second block is 0% and it is called "net benefit block". members of second block are re, bank, tsp, mat and the, indicating that these industries are more sensitive to external risk shocks and are the largest systemic risk contributors during the us-china trade friction period. such as, the most possibility reason for the real estate industry (re) to act as the risk transmitter in the risk network is that, it has a high degree of industry connectivity, which drives the development of materials, manufacturing, bank, home and personal items and other industries. once the real estate industry is sluggish, it will cause turmoil in the entire industry chain. the sending links of third block is 20, of which 3 links are in this block, and it mainly accepts the relationship from the fourth block; the expected internal relation ratio is 26.09%, and the actual relation ratio is 15%, thereby it is called "broker block", which plays a role as a "bridge" in systemic risk transmission. importantly, strong spillover transmission between blocks may depend on the functions of "broker block". the reason may be that mutual linkage and bidirectional economic or financial effects between their members and other blocks' members. the sending links of the fourth block is 45, of which 9 internal links of this block, and it mainly sends the relationship to third block; the expected internal relation ratio is 21.74%, and the actual relation ratio is 20%, thereby it is called "main spillover block". members of fourth block are bps, dcgc, df, ins, ss and sspe, which are levied high tariffs by the us, therefore they become the spillover engine. overall, the internal links ratio of the first block is high, while the ratio of the second and third blocks is low. the detailed analysis of period 2-4 are not listed due to space limitations, and the detailed results are shown in table 4 . in order to more clearly reveal the spillover distribution and relative roles of the tail risk relationship between the 24 sectors, we calculate the density matrix and image matrix of each block (shown in table 5 ). the overall density values of the tail risk network in five periods are 0.246, 0.257, 0.219, 0.230 and 0.210, respectively. here, the overall network density is selected as the critical value. if a block's density is greater than the overall network density, the corresponding position in the image matrix is assigned 1, otherwise, the value is 0. for example, in period 1, the density of first block is 0.417 that is greater than the overall network density (0.246). it shows that the block's density is greater than the average value of whole network, and the risk spillover linkages within a block have a significant tendency to concentrate. from the image matrix in table 5 , take period 1 as an example, we find out that: (i) the diagonal elements of the image matrix in four blocks are 1, showing that internal risk spillovers in the block are closely related and indicating that it has obvious "rich-club" effect; (ii) the first block receives risk spillover connections from the second and fourth blocks; (iii) the second block receives risk spatial connections main from the third block, and it plays the role of a "bridge" and realizes the interconnection of risk spatial spillover in the first and third blocks; (iv) the fourth block realizes the correlation and interaction with other blocks due to the risk spillover association to the first block. the results prove that the interconnections between the different blocks do not occur directly, mainly through the transfer of the first and second blocks. in the following, we continue to analyze our tail risk spillover across blocks. in this context, fig.7 displays the dynamic evolution of the risk spillover transmission mechanism between four blocks. it is observed that the spatial connectedness between the risk spillover blocks is time-varying since the members of the blocks are also time-varying, thereby the blocks' features in the risk transmission process are different at different periods. from fig.7 , it is easy to find that the risk transmission path across the blocks is more complicated during the first, second and fifth periods, and the risk transmission path across the blocks is simpler in the third and fourth periods. the most likely reason is that the sources of risk are different, i.e., the first, third and fifth periods are caused by the turmoil in the foreign market, which causes the changes in the relevant industries in the china; the second and fourth periods are caused by domestic macroeconomic regulation or certain sectors with higher levels of accumulation systemic risk. thus, risk shocks originated in a particular sector spread globally to the sectors of other blocks in a more or less homogeneous way, although some blocks are not directly related to each other. for example, in period 3, the source of infection between the risk spillover blocks is the second block, which spreads systemic risk shock to the first, third and fourth blocks simultaneously. however, there is no significant transmission channel of risk between the first and third blocks. members of second block are cs, fbt, hes, pbls and bank, which should be received more attention and supervision from the regulatory authorities, and investors should avoid investment in these industries. in addition, in period 5, the forth block acts as the risk spillover engine and directly transmits the risk shocks to first, second and third blocks. the members of forth block include bps, dcgc, df, ins, ss and sspe, of which dcgc, ss and sspe are subject to high tariffs of the trade policy in the us to china, and in this vein, the export of related products in these sectors are seriously affected, which in turn can easy to break out systemic risk. simultaneously, both the first and third blocks transmit the tail risk from the fourth block to the second block which acts as a distinct bridge and hub. therefore, the second block is the most sensitive block since it accepts the risk spillover from all blocks. due to the space limitations, 20 the analysis of risk transmission paths in other periods will not be repeated. this paper applies single-index model in a generalized quantile regression framework to assess non-linearity relationship and variable selection, and in this vein, we construct dynamic tail risk network for 24 chinese sectors from 2007-2018. at the global level, we first analyze the connectedness of systemic risk spillovers in tail risk network, and investigate the impact of network concentration on systemic risk. at the individual sector level, we calculate the risk contagion or absorption intensity of each sector, and adopt pagerank method to identify systemically important sector. finally, using block model to study the spillover distribution and relative roles of the tail risk relationship between the 24 sectors, and understand the financial risks transmission process across various sectors. in this research, we report the following findings. first, there is a tail risk network that connects all sectors in chinese stock market, and it exposes to more systemic risk and total connectedness during market distress. further, the edge concentration of risk network (hhi) is used to measure risk network interconnectedness and concentration, and it exhibits obvious cyclical features. during the tail event (market downside) periods, the hhi index increases significantly, and then the risk network is relatively single central node structure, thereby, the network stability is poor. the results show that multi-centered financial network, rather than a single pivotal center, can maintain financial market stability. second, the directional connectedness of sectors shows that systemic risk receivers and transmitters vary across time, and provide an evidence about "too linked to fail". besides, we identify two influential sectors released by utilities and financial sectors, which should be received more attention in the overall time period from both regulators and investors. finally, we find that the sectoral tail risk network can be divided into four different spillover function blocks by block model, which can more clearly reflect risk spillover distribution and roles of relevant industries in the process of systemic risk transmission. the role of blocks and the spatial spillover transmission path between risk blocks are time-varying. this study has important policy implications for cross-sector linkages and systemic risk spillovers in chinese stock market. first, it is necessary for the government to issue favorable policies such as the sectoral development policies or macro-control policies in a timely manner, which will promote the influence of relevant industries in the stock market, and thus create a multicentered node to maintain the financial market network stability. second, for investors, they should pay more attention to the systemically important sectors and make reversal strategies around these sectors to configure their assets and portfolios for risk minimization. for supervision department, they may consider the features of four blocks and its spillover paths to formulate different financial regulatory policies that improve the macro-prudential framework during stock market recession and instability periods. a thorough analysis about sectoral tail risk spillover and its spatial connectedness could successfully monitor systematic risks and keep financial system stability, which in turn, contributes to the smooth functioning of the real economy. systemic risk and stability in financial networks capital shortfall: a new approach to ranking and regulating systemic risk measuring systemic risk covar. federal reserve bank of new york staff report which are the sifis? a component expected shortfall approach to systemic risk complexity theory and financial regulation international spillover in global asset markets where the risks lie: a survey on systemic risk monetary policy and u.s. long-term interest rates: how close are the linkages? a survey of systemic risk analytics econometric measures of connectedness and systemic risk in the finance and insurance sectors systemic risk in an interconnected banking system with dendogenous asset markets career attributes and network structure: a block model study of a biomedical research specially srisk: a conditional capital shortfall measure of systemic risk quantile regression in risk calibration cross-border linkages of chinese banks and dynamic network structure of the international banking industry on the network topology of variance decompositions: measuring the connectedness of financial firms the transmission of shocks among s&p indexes interconnectedness and systemic risk: a comparative study based on systemically important regions systemic risk measurement: multivariate garch estimation of covar tenet: tail-event driven network risk financial network systemic risk contributions financial network linkages to predict economic output a framework for assessing the systemic risk of major financial institutions co-movement of coherence between oil prices and the stock market from the joint time-frequency perspective partial correlation analysis: applications for financial markets interbank lending and the spread of bank failures: a network model of systemic risk principal component as a measure of systemic risk bank size, capital, and systemic risk: some international evidence study on the spatial correlation and explanation of regional economic growth in china features of spillover networks in international financial markets: evidence from g20 countries return spillovers around the globe: a network approach. o s y o economic modelling hierarchical structure in financial markets dynamics of market correlations: taxonomy and portfolio analysis the pagerank citation ranking: bringing order to the web a simple indicator of systemic risk orthogonal pulse based wideband communication for high speed data transfer in sensor applications sectoral and industrial performance during a stock market crisis systemic risk measures: the simpler the better? china's regional financial risk spatial correlation network and regional contagion effect structural position in the world system and economic growth, 1955~1970: a multiple network analysis of transitional interactions a tool for filtering information in complex systems analysing the systemic risk of indian banks granger causality stock market networks: temporal proximity and preferential attachment extreme risk spillover network: application to financial institutions correlation structure and evolution of world stock markets: evidence from pearson and partial correlation-based networks interconnectedness and systemic risk of china's financial identifying influential energy stocks based on spillover network social network analysis: methods and application social structure from multiple networks i: block models of roles and positions connectedness and risk spillover in china's stock market: a sectoral analysis study on the contagion among american industries spatial spillover effects and risk contagion around g20 stock market based on volatility network spatial connectedness of volatility spillovers in g20 stock markets: based on block models analysis ☒ the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐the authors declare the following financial interests/personal relationships which may be considered as potential competing interests:no! key: cord-317435-4yuw7jo3 authors: zhou, yadi; hou, yuan; shen, jiayu; huang, yin; martin, william; cheng, feixiong title: network-based drug repurposing for novel coronavirus 2019-ncov/sars-cov-2 date: 2020-03-16 journal: cell discov doi: 10.1038/s41421-020-0153-3 sha: doc_id: 317435 cord_uid: 4yuw7jo3 human coronaviruses (hcovs), including severe acute respiratory syndrome coronavirus (sars-cov) and 2019 novel coronavirus (2019-ncov, also known as sars-cov-2), lead global epidemics with high morbidity and mortality. however, there are currently no effective drugs targeting 2019-ncov/sars-cov-2. drug repurposing, representing as an effective drug discovery strategy from existing drugs, could shorten the time and reduce the cost compared to de novo drug discovery. in this study, we present an integrative, antiviral drug repurposing methodology implementing a systems pharmacology-based network medicine platform, quantifying the interplay between the hcov–host interactome and drug targets in the human protein–protein interaction network. phylogenetic analyses of 15 hcov whole genomes reveal that 2019-ncov/sars-cov-2 shares the highest nucleotide sequence identity with sars-cov (79.7%). specifically, the envelope and nucleocapsid proteins of 2019-ncov/sars-cov-2 are two evolutionarily conserved regions, having the sequence identities of 96% and 89.6%, respectively, compared to sars-cov. using network proximity analyses of drug targets and hcov–host interactions in the human interactome, we prioritize 16 potential anti-hcov repurposable drugs (e.g., melatonin, mercaptopurine, and sirolimus) that are further validated by enrichment analyses of drug-gene signatures and hcov-induced transcriptomics data in human cell lines. we further identify three potential drug combinations (e.g., sirolimus plus dactinomycin, mercaptopurine plus melatonin, and toremifene plus emodin) captured by the “complementary exposure” pattern: the targets of the drugs both hit the hcov–host subnetwork, but target separate neighborhoods in the human interactome network. in summary, this study offers powerful network-based methodologies for rapid identification of candidate repurposable drugs and potential drug combinations targeting 2019-ncov/sars-cov-2. coronaviruses (covs) typically affect the respiratory tract of mammals, including humans, and lead to mild to severe respiratory tract infections 1 . in the past two decades, two highly pathogenic human covs (hcovs), including severe acute respiratory syndrome coronavirus (sars-cov) and middle east respiratory syndrome coronavirus (mers-cov), emerging from animal reservoirs, have led to global epidemics with high morbidity and mortality 2 . for example, 8098 individuals were infected and 774 died in the sars-cov pandemic, which cost the global economy with an estimated $30 to $100 billion 3, 4 . according to the world health organization (who), as of november 2019, mers-cov has had a total of 2494 diagnosed cases causing 858 deaths, the majority in saudi arabia 2 . in december 2019, the third pathogenic hcov, named 2019 novel coronavirus (2019-ncov/sars-cov-2), as the cause of coronavirus disease 2019 (abbreviated as covid19) 5 , was found in wuhan, china. as of 24 february 2020, there have been over 79,000 cases with over 2600 deaths for the 2019-ncov/sars-cov-2 outbreak worldwide; furthermore, human-to-human transmission has occurred among close contacts 6 . however, there are currently no effective medications against 2019-ncov/sars-cov-2. several national and international research groups are working on the development of vaccines to prevent and treat the 2019-ncov/sars-cov-2, but effective vaccines are not available yet. there is an urgent need for the development of effective prevention and treatment strategies for 2019-ncov/sars-cov-2 outbreak. although investment in biomedical and pharmaceutical research and development has increased significantly over the past two decades, the annual number of new treatments approved by the u.s. food and drug administration (fda) has remained relatively constant and limited 7 . a recent study estimated that pharmaceutical companies spent $2.6 billion in 2015, up from $802 million in 2003, in the development of an fda-approved new chemical entity drug 8 . drug repurposing, represented as an effective drug discovery strategy from existing drugs, could significantly shorten the time and reduce the cost compared to de novo drug discovery and randomized clinical trials [9] [10] [11] . however, experimental approaches for drug repurposing is costly and time-consuming 12 . computational approaches offer novel testable hypotheses for systematic drug repositioning [9] [10] [11] 13, 14 . however, traditional structure-based methods are limited when threedimensional (3d) structures of proteins are unavailable, which, unfortunately, is the case for the majority of human and viral targets. in addition, targeting single virus proteins often has high risk of drug resistance by the rapid evolution of virus genomes 1 . viruses (including hcov) require host cellular factors for successful replication during infection 1 . systematic identification of virus-host protein-protein interactions (ppis) offers an effective way toward elucidating the mechanisms of viral infection 15, 16 . subsequently, targeting cellular antiviral targets, such as virus-host interactome, may offer a novel strategy for the development of effective treatments for viral infections 1 , including sars-cov 17 , mers-cov 17 , ebola virus 18 , and zika virus 14, [19] [20] [21] . we recently presented an integrated antiviral drug discovery pipeline that incorporated gene-trap insertional mutagenesis, known functional drug-gene network, and bioinformatics analyses 14 . this methodology allows to identify several candidate repurposable drugs for ebola virus 11, 14 . our work over the last decade has demonstrated how network strategies can, for example, be used to identify effective repurposable drugs 13, [22] [23] [24] [25] [26] [27] and drug combinations 28 for multiple human diseases. for example, network-based drug-disease proximity sheds light on the relationship between drugs (e.g., drug targets) and disease modules (molecular determinants in disease pathobiology modules within the ppis), and can serve as a useful tool for efficient screening of potentially new indications for approved drugs, as well as drug combinations, as demonstrated in our recent studies 13, 23, 27, 28 . in this study, we present an integrative antiviral drug repurposing methodology, which combines a systems pharmacology-based network medicine platform that quantifies the interplay between the virus-host interactome and drug targets in the human ppi network. the basis for these experiments rests on the notions that (i) the proteins that functionally associate with viral infection (including hcov) are localized in the corresponding subnetwork within the comprehensive human ppi network and (ii) proteins that serve as drug targets for a specific disease may also be suitable drug targets for potential antiviral infection owing to common ppis and functional pathways elucidated by the human interactome ( fig. 1) . we follow this analysis with bioinformatics validation of drug-induced gene signatures and hcovinduced transcriptomics in human cell lines to inspect the postulated mechanism-of-action in a specific hcov for which we propose repurposing (fig. 1 ). to date, seven pathogenic hcovs (fig. 2a, b) have been found: 1, 29 (i) 2019-ncov/sars-cov-2, sars-cov, mers-cov, hcov-oc43, and hcov-hku1 are β genera, and (ii) hcov-nl63 and hcov-229e are α genera. we performed the phylogenetic analyses using the wholegenome sequence data from 15 hcovs to inspect the evolutionary relationship of 2019-ncov/sars-cov-2 with other hcovs. we found that the whole genomes of 2019-ncov/sars-cov-2 had~99.99% nucleotide sequence identity across three diagnosed patients (supplementary table s1 ). the 2019-ncov/sars-cov-2 shares the highest nucleotide sequence identity (79.7%) with sars-cov among the six other known pathogenic hcovs, revealing conserved evolutionary relationship between 2019-ncov/sars-cov-2 and sars-cov (fig. 2a) . hcovs have five major protein regions for virus structure assembly and viral replications 29 , including replicase complex (orf1ab), spike (s), envelope (e), membrane (m), and nucleocapsid (n) proteins (fig. 2b) . the orf1ab gene encodes the non-structural proteins (nsp) of viral rna synthesis complex through proteolytic processing 30 . the nsp12 is a viral rna-dependent rna polymerase, together with co-factors nsp7 and nsp8 possessing high polymerase activity. from the protein 3d structure view of sars-cov nsp12, it contains a larger n-terminal extension (which binds to nsp7 and nsp8) and polymerase domain (fig. 2c) . the spike is a transmembrane glycoprotein that plays a pivotal role in mediating viral infection through binding the host receptor 31, 32 . figure 2d shows the 3d structure of the spike protein bound with the host receptor angiotensin converting enznyme2 (ace2) in sars-cov (pdb id: 6ack). a recent study showed that 2019-ncov/sars-cov-2 is able to utilize ace2 as an entry receptor in ace2-expressing cells 33 , suggesting potential drug targets for therapeutic development. furthermore, cryo-em structure of the spike and biophysical assays reveal that the 2019-ncov/sars-cov-2 spike binds ace2 with higher affinity than sars-cov 34 . in addition, the nucleocapsid is also an important subunit for packaging the viral genome through protein oligomerization 35 , and the single nucleocapsid structure is shown in fig. 2e . protein sequence alignment analyses indicated that the 2019-ncov/sars-cov-2 was most evolutionarily conserved with sars-cov (supplementary table s2 ). specifically, the envelope and nucleocapsid proteins of 2019-ncov/sars-cov-2 are two evolutionarily conserved regions, with sequence identities of 96% and 89.6%, respectively, compared to sars-cov (supplementary table s2 ). however, the spike protein exhibited the lowest sequence conservation (sequence identity of 77%) between 2019-ncov/sars-cov-2 and sars-cov. meanwhile, the spike protein of 2019-ncov/sars-cov-2 only has 31.9% sequence identity compared to mers-cov. fig. 1 overall workflow of this study. our network-based methodology combines a systems pharmacology-based network medicine platform that quantifies the interplay between the virus-host interactome and drug targets in the human ppi network. a human coronavirus (hcov)-associated host proteins were collected from literatures and pooled to generate a pan-hcov protein subnetwork. b network proximity between drug targets and hcov-associated proteins was calculated to screen for candidate repurposable drugs for hcovs under the human protein interactome model. c, d gene set enrichment analysis was utilized to validate the network-based prediction. e top candidates were further prioritized for drug combinations using network-based method captured by the "complementary exposure" pattern: the targets of the drugs both hit the hcov-host subnetwork, but target separate neighborhoods in the human interactome network. f overall hypothesis of the network-based methodology: (i) the proteins that functionally associate with hcovs are localized in the corresponding subnetwork within the comprehensive human interactome network; and (ii) proteins that serve as drug targets for a specific disease may also be suitable drug targets for potential antiviral infection owing to common protein-protein interactions elucidated by the human interactome. to depict the hcov-host interactome network, we assembled the cov-associated host proteins from four known hcovs (sars-cov, mers-cov, hcov-229e, and hcov-nl63), one mouse mhv, and one avian ibv (n protein) (supplementary table s3 ). in total, we obtained 119 host proteins associated with covs with various experimental evidence. specifically, these host proteins are either the direct targets of hcov proteins or are involved in crucial pathways of hcov infection. the hcov-host interactome network is shown in fig. 3a . we identified several hub proteins including jun, xpo1, npm1, and hnrnpa1, with the highest number of connections within the 119 proteins. kegg pathway enrichment analysis revealed multiple significant biological pathways (adjusted p value < 0.05), including measles, rna transport, nf-kappa b signaling, epstein-barr virus infection, and influenza (fig. 3b ). gene ontology (go) biological process enrichment analysis further confirmed multiple viral infection-related processes (adjusted p value < 0.001), including viral life cycle, modulation by virus of host morphology or physiology, viral process, positive regulation of viral life cycle, transport of virus, and virion attachment to host cell (fig. 3c ). we then mapped the known drug-target network (see materials and methods) into the hcov-host interactome to search for druggable, cellular targets. we found that 47 human proteins (39%, blue nodes in fig. 3a) can be targeted by at least one approved drug or experimental drug under clinical trials. for example, gsk3b, dpp4, smad3, parp1, and ikbkb are the most targetable proteins. the high druggability of hcov-host interactome motivates us to develop a drug repurposing strategy by specifically targeting cellular proteins associated with hcovs for potential treatment of 2019-ncov/sars-cov-2. the basis for the proposed network-based drug repurposing methodologies rests on the notions that the proteins that associate with and functionally govern viral infection are localized in the corresponding subnetwork ( fig. 1a) within the comprehensive human interactome network. for a drug with multiple targets to be effective against an hcov, its target proteins should be within or in the immediate vicinity of the corresponding subnetwork in the human protein-protein interactome ( fig. 1 ), as we demonstrated in multiple diseases 13, 22, 23, 28 using this network-based strategy. we used a state-of-theart network proximity measure to quantify the relationship between hcov-specific subnetwork (fig. 3a) and drug targets in the human interactome. we constructed a drug-target network by assembling target information for more than 2000 fda-approved or experimental drugs (see materials and methods). to improve the quality and completeness of the human protein interactome network, we integrated ppis with five types of experimental data: (1) binary ppis from 3d protein structures; (2) binary ppis from unbiased high-throughput yeast-two-hybrid assays; (3) experimentally identified kinase-substrate interactions; (4) signaling networks derived from experimental data; and (5) literature-derived ppis with various experimental evidence (see materials and methods). we used a z-score (z) measure and permutation test to reduce the study bias in network proximity analyses (including hub nodes in the human interactome network by literature-derived ppi data bias) as described in our recent studies 13, 28 . in total, we computationally identified 135 drugs that were associated (z < −1.5 and p < 0.05, permutation test) with the hcov-host interactome (fig. 4a , supplementary tables s4 and 5 ). to validate bias of the pooled cellular proteins from six covs, we further calculated the network proximities of all the drugs for four covs with a large number of know host proteins, including sars-cov, mers-cov, ibv, and mhv, separately. we found that the z-scores showed consistency among the pooled 119 hcov-associated proteins and other four individual covs (fig. 4b) . the pearson correlation coefficients of the proximities of all the drugs for the pooled hcov are 0.926 vs. sars-cov (p < 0.001, t distribution), 0.503 vs. mers-cov (p < 0.001), 0.694 vs. ibv (p < 0.001), and 0.829 vs. mhv (p < 0.001). these network proximity analyses offer putative repurposable candidates for potential prevention and treatment of hcovs. to further validate the 135 repurposable drugs against hcovs, we first performed gene set enrichment analysis (gsea) using transcriptome data of mers-cov and sars-cov infected host cells (see methods). these transcriptome data were used as gene signatures for hcovs. additionally, we downloaded the gene expression data of drug-treated human cell lines from the connectivity map (cmap) database 36 to obtain drug-gene signatures. we calculated a gsea score (see methods) for each drug and used this score as an indication of bioinformatics validation of the 135 drugs. specifically, an enrichment score (es) was calculated for each hcov data set, and es > 0 and p < 0.05 (permutation test) was used as cut-off for a significant association of gene signatures between a drug and a specific hcov data set. the gsea score, ranging from 0 to 3, is the number of data sets that met these criteria for a specific drug. mesalazine (an fig. 4 a discovered drug-hcov network. a a subnetwork highlighting network-predicted drug-hcov associations connecting 135 drugs and hcovs. from the 2938 drugs evaluated, 135 ones achieved significant proximities between drug targets and the hcov-associated proteins in the human interactome network. drugs are colored by their first-level of the anatomical therapeutic chemical (atc) classification system code. b a heatmap highlighting network proximity values for sars-cov, mers-cov, ibv, and mhv, respectively. color key denotes network proximity (z-score) between drug targets and the hcov-associated proteins in the human interactome network. p value was computed by permutation test. approved drug for inflammatory bowel disease), sirolimus (an approved immunosuppressive drug), and equilin (an approved agonist of the estrogen receptor for menopausal symptoms) achieved the highest gsea scores of 3, followed by paroxetine and melatonin with gsea scores of 2. we next selected 16 high-confidence repurposable drugs ( fig. 5a and table 1 ) against hcovs using subject matter expertise based on a combination of factors: (i) strength of the network-predicted associations (a smaller network proximity score in supplementary table s4 ); (ii) validation by gsea analyses; (iii) literature-reported antiviral evidence, and (iv) fewer clinically reported side effects. specifically, we showcased several selected repurposable drugs with literature-reported antiviral evidence as below. an overexpression of estrogen receptor has been shown to play a crucial role in inhibiting viral replication 37 . selective estrogen receptor modulators (serms) have been reported to play a broader role in inhibiting viral replication through the non-classical pathways associated with estrogen receptor 37 . serms interfere at the post viral entry step and affect the triggering of fusion, as the serms' antiviral activity still can be observed in the absence of detectable estrogen receptor expression 18 . toremifene (z = -3.23, fig. 5a ), the first generation of nonsteroidal serm, exhibits potential effects in blocking various viral infections, including mers-cov, sars-cov, and ebola virus in established cell lines 17, 38 . compared to the classical esr1-related antiviral pathway, toremifene prevents fusion between the viral and endosomal membrane by interacting with and destabilizing the virus membrane glycoprotein, and eventually inhibiting viral replication 39 . as shown in fig. 5b , toremifene potentially affects several key host proteins associated with hcov, such as rpl19, hnrnpa1, npm1, eif3i, eif3f, and eif3e 40, 41 . equilin (z = -2.52 and gsea score = 3), an estrogenic steroid produced by horses, also has been proven to have moderate activity in inhibiting the entry of zaire ebola virus glycoprotein and human immunodeficiency virus (zebov-gp/hiv) 18 . altogether, network-predicted serms (such as toremifene and equilin) offer candidate repurposable drugs for 2019-ncov/sars-cov-2. angiotensin receptor blockers (arbs) have been reported to associate with viral infection, including hcovs [42] [43] [44] . irbesartan (z = -5.98), a typical arb, was approved by the fda for treatment of hypertension and diabetic nephropathy. here, network proximity analysis shows a significant association between irbesartan's targets and hcov-associated host proteins in the human interactome. as shown in fig. 5c , irbesartan targets slc10a1, encoding the sodium/bile acid cotransporter (ntcp) protein that has been identified as a functional pres1-specific receptor for the hepatitis b virus (hbv) and the hepatitis delta virus (hdv). irbesartan can inhibit ntcp, thus inhibiting viral entry 45, 46 . slc10a1 interacts with c11orf74, a potential transcriptional repressor that interacts with nsp-10 of sars-cov 47 . there are several other arbs (such as eletriptan, frovatriptan, and zolmitriptan) in which their targets are potentially associated with hcov-associated host proteins in the human interactome. previous studies have confirmed the mammalian target of rapamycin complex 1 (mtorc1) as the key factor in regulating various viruses' replications, including andes orthohantavirus and coronavirus 48, 49 . sirolimus (z = -2.35 and gsea score = 3), an inhibitor of mammalian target of rapamycin (mtor), was reported to effectively block viral protein expression and virion release effectively 50 . indeed, the latest study revealed the clinical application: sirolimus reduced mers-cov infection by over 60% 51 . moreover, sirolimus usage in managing patients with severe h1n1 pneumonia and acute respiratory failure can improve those patients' prognosis significantly 50 . mercaptopurine (z = -2.44 and gsea score = 1), an antineoplastic agent with immunosuppressant property, has been used to treat cancer since the 1950s and expanded its application to several autoimmune diseases, including rheumatoid arthritis, systemic lupus erythematosus, and crohn's disease 52 . (see figure on previous page) fig. 5 a discovered drug-protein-hcov network for 16 candidate repurposable drugs. a network-predicted evidence and gene set enrichment analysis (gsea) scores for 16 potential repurposable drugs for hcovs. the overall connectivity of the top drug candidates to the hcovassociated proteins was examined. most of these drugs indirectly target hcov-associated proteins via the human protein-protein interaction networks. all the drug-target-hcov-associated protein connections were examined, and those proteins with at least five connections are shown. the box heights for the proteins indicate the number of connections. gsea scores for eight drugs were not available (na) due to the lack of transcriptome profiles for the drugs. b-e inferred mechanism-of-action networks for four selected drugs: b toremifene (first-generation nonsteroidalselective estrogen receptor modulator), c irbesartan (an angiotensin receptor blocker), d mercaptopurine (an antimetabolite antineoplastic agent with immunosuppressant properties), and e melatonin (a biogenic amine for treating circadian rhythm sleep disorders). 53, 54 . mechanistically, mercaptopurine potentially target several host proteins in hcovs, such as jun, pabpc1, npm1, and ncl 40, 55 (fig. 5d) . inflammatory pathways play essential roles in viral infections 56, 57 . as a biogenic amine, melatonin (n-acetyl-5-methoxytryptamine) (z = -1.72 and gsea score = 2) plays a key role in various biological processes, and offers a potential strategy in the management of viral infections 58, 59 . viral infections are often associated with immune-inflammatory injury, in which the level of oxidative stress increases significantly and leaves negative effects on the function of multiple organs 60 . the antioxidant effect of melatonin makes it a putative candidate drug to relieve patients' clinical symptoms in antiviral treatment, even though melatonin cannot eradicate or even curb the viral replication or transcription 61, 62 . in addition, the application of melatonin may prolong patients' survival time, which may provide a chance for patients' immune systems to recover and eventually eradicate the virus. as shown in fig. 5e , melatonin indirectly targets several hcov cellular targets, including ace2, bcl2l1, jun, and ikbkb. eplerenone (z = -1.59), an aldosterone receptor antagonist, is reported to have a similar anti-inflammatory effect as melatonin. by inhibiting mast-cell-derived proteinases and suppressing fibrosis, eplerenone can improve survival of mice infected with encephalomyocarditis virus 63 . in summary, our network proximity analyses offer multiple candidate repurposable drugs that target diverse cellular pathways for potential prevention and treatment of 2019-ncov/sars-cov-2. however, further preclinical experiments 64 and clinical trials are required to verify the clinical benefits of these network-predicted candidates before clinical use. drug combinations, offering increased therapeutic efficacy and reduced toxicity, play an important role in treating various viral infections 65 . however, our ability to identify and validate effective combinations is limited by a combinatorial explosion, driven by both the large number of drug pairs and dosage combinations. in our recent study, we proposed a novel network-based methodology to identify clinically efficacious drug combinations 28 . relying on approved drug combinations for hypertension and cancer, we found that a drug combination was therapeutically effective only if it was captured by the "complementary exposure" pattern: the targets of the drugs both hit the disease module, but target separate neighborhoods (fig. 6a) . here we sought to identify drug combinations that may provide a synergistic effect in potentially treating 2019-ncov/sars-cov-2 with welldefined mechanism-of-action by network analysis. for the 16 potential repurposable drugs (fig. 5a, table 1 ), we showcased three network-predicted candidate drug combinations for 2019-ncov/sars-cov-2. all predicted possible combinations can be found in supplementary table s6 . sirolimus, an inhibitor of mtor with both antifungal and antineoplastic properties, has demonstrated to improve outcomes in patients with severe h1n1 pneumonia and acute respiratory failure 50 . the mtor signaling plays an essential role for mers-cov infection 66 . dactinomycin, also known actinomycin d, is an approved rna synthesis inhibitor for treatment of various cancer types. an early study showed that dactinomycin (1 μg/ml) inhibited the growth of feline enteric cov 67 . as shown in fig. 6b , our network analysis shows that sirolimus and dactinomycin synergistically target hcov-associated host protein subnetwork by "complementary exposure" pattern, offering potential combination regimens for treatment of hcov. specifically, sirolimus and dactinomycin may inhibit both mtor signaling and rna synthesis pathway (including dna topoisomerase 2-alpha (top2a) and dna topoisomerase 2-beta (top2b)) in hcov-infected cells (fig. 6b) . toremifene is among the approved first-generation nonsteroidal serms for the treatment of metastatic breast cancer 68 . serms (including toremifene) inhibited ebola virus infection 18 by interacting with and destabilizing the ebola virus glycoprotein 39 . in vitro assays have demonstrated that toremifene inhibited growth of mers-cov 17,69 and sara-cov 38 (table 1) . emodin, an anthraquinone derivative extracted from the roots of rheum tanguticum, has been reported to have various anti-virus effects. specifically, emdoin inhibited sars-cov-associated 3a protein 70 , and blocked an interaction between the sars-cov spike protein and ace2 (ref. 71 ). altogether, network analyses and published experimental data suggested that combining toremifene and emdoin offered a potential therapeutic approach for 2019-ncov/ sars-cov-2 (fig. 6c) . as shown in fig. 5a , targets of both mercaptopurine and melatonin showed strong network proximity with hcovassociated host proteins in the human interactome network. recent in vitro and in vivo studies identified mercaptopurine as a selective inhibitor of both sars-cov and mers-cov by targeting papain-like protease 53, 54 . melatonin was reported in potential antiviral infection via its anti-inflammatory and antioxidant effects [58] [59] [60] [61] [62] . melatonin indirectly regulates ace2 expression, a key entry receptor involved in viral infection of hcovs, including 2019-ncov/sars-cov-2 (ref. 33 ). specifically, melatonin was reported to inhibit calmodulin and calmodulin interacts with ace2 by inhibiting shedding of its ectodomain, a key infectious process of sars-cov 72, 73 . jun, also known as c-jun, is a key host protein involving in hcov infectious bronchitis virus 74 . as shown in fig. 6d , mercaptopurine and melatonin may synergistically block c-jun signaling by targeting multiple cellular targets. in summary, combination of mercaptopurine and melatonin may offer a potential combination therapy for 2019-ncov/sars-cov-2 by synergistically targeting papainlike protease, ace2, c-jun signaling, and antiinflammatory pathways (fig. 6d) . however, further experimental observations on ace2 pathways by melatonin in 2019-ncov/sars-cov-2 are highly warranted. in this study, we presented a network-based methodology for systematic identification of putative repurposable drugs and drug combinations for potential treatment of 2019-ncov/sars-cov-2. integration of drug-target networks, hcov-host interactions, hcovinduced transcriptome in human cell lines, and human protein-protein interactome network are essential for such identification. based on comprehensive evaluation, we prioritized 16 candidate repurposable drugs (fig. 5 ) and 3 potential drug combinations (fig. 6) for targeting 2019-ncov/sars-cov-2. however, although the majority of predictions have been validated by various literature data (table 1) , all network-predicted repurposable drugs and drug combinations must be validated in various 2019-ncov/sars-cov-2 experimental assays 64 and randomized clinical trials before being used in patients. we acknowledge several limitations in the current study. although 2019-ncov/sars-cov-2 shared high nucleotide sequence identity with other hcovs (fig. 2) , our predictions are not 2019-ncov/sars-cov-2 specific by lack of the known host proteins on 2019-ncov/sars-cov-2. we used a low binding affinity value of 10 μm as a threshold to define a physical drug-target interaction. however, a stronger binding affinity threshold (e.g., 1 μm) may be a more suitable cut-off in drug discovery, although it will generate a smaller drug-target network. although sizeable efforts were made for assembling large scale, experimentally reported drug-target networks from publicly available databases, the network data may be incomplete and some drug-target interactions may be functional associations, instead of physical bindings. for example, silvestrol, a natural product from the flavagline, was found to have antiviral activity against ebola 75 and coronaviruses 76 . after adding its target, an rna helicase enzyme eif4a 76 , silvestrol was predicted to be significantly associated with hcovs (z = -1.24, p = 0.041) by network proximity analysis. to increase coverage of drug-target networks, we may use computational approaches to systematically predict the drug-target interactions further 25, 26 . in addition, the collected virus-host interactions are far from completeness and the quality can be influenced by multiple factors, including different experimental assays and human cell line models. we may computationally predict a new virus-host interactome for 2019-ncov/sars-cov-2 using sequence-based and structure-based approaches 77 . drug targets representing nodes within cellular networks are often intrinsically coupled with both therapeutic and adverse profiles 78 , as drugs can inhibit or activate protein functions (including antagonists vs. agonists). the current systems pharmacology model cannot separate therapeutic (antiviral) effects from those predictions due to lack of detailed pharmacological effects of drug targets and unknown functional consequences of virus-host interactions. comprehensive identification of the virus-host interactome for 2019-ncov/sars-cov-2, with specific biological effects using functional genomics assays 79, 80 , will significantly improve the accuracy of the proposed network-based methodologies further. owing to a lack of the complete drug-target information (such as the molecular "promiscuity" of drugs), the dose-response and dose-toxicity effects for both (see figure on previous page) fig. 6 network-based rational design of drug combinations for 2019-ncov/sars-cov-2. a the possible exposure mode of the hcovassociated protein module to the pairwise drug combinations. an effective drug combination will be captured by the "complementary exposure" pattern: the targets of the drugs both hit the hcov-host subnetwork, but target separate neighborhoods in the human interactome network. z ca and z cb denote the network proximity (z-score) between targets (drugs a and b) and a specific hcov. s ab denotes separation score (see materials and methods) of targets between drug a and drug b. b-d inferred mechanism-of-action networks for three selected pairwise drug combinations: b sirolimus (a potent immunosuppressant with both antifungal and antineoplastic properties) plus dactinomycin (an rna synthesis inhibitor for treatment of various tumors), c toremifene (first-generation nonsteroidal-selective estrogen receptor modulator) plus emodin (an experimental drug for the treatment of polycystic kidney), and d melatonin (a biogenic amine for treating circadian rhythm sleep disorders) plus mercaptopurine (an antimetabolite antineoplastic agent with immunosuppressant properties). repurposable drugs and drug combinations cannot be identified in the current network models. for example, mesalazine, an approved drug for inflammatory bowel disease, is a top network-predicted repurposable drug associated with hcovs (fig. 5a ). yet, several clinical studies showed the potential pulmonary toxicities (including pneumonia) associated with mesalazine usage 81, 82 . integration of lung-specific gene expression 23 of 2019-ncov/sars-cov-2 host proteins and physiologically based pharmacokinetic modeling 83 may reduce side effects of repurposable drugs or drug combinations. preclinical studies are warranted to evaluate in vivo efficiency and side effects before clinical trials. furthermore, we only limited to predict pairwise drug combinations based on our previous network-based framework 28 . however, we expect that our methodology remain to be a useful network-based tool for prediction of combining multiple drugs toward exploring network relationships of multiple drugs' targets with the hcov-host subnetwork in the human interactome. finally, we aimed to systematically identify repurposable drugs by specifically targeting ncov host proteins only. thus, our current network models cannot predict repurposable drugs from the existing anti-virus drugs that target virus proteins only. thus, combination of the existing anti-virus drugs (such as remdesivir 64 ) with the network-predicted repurposable drugs (fig. 5 ) or drug combinations (fig. 6 ) may improve coverage of current network-based methodologies by utilizing multi-layer network framework 16 . in conclusion, this study offers a powerful, integrative network-based systems pharmacology methodology for rapid identification of repurposable drugs and drug combinations for the potential treatment of 2019-ncov/ sars-cov-2. our approach can minimize the translational gap between preclinical testing results and clinical outcomes, which is a significant problem in the rapid development of efficient treatment strategies for the emerging 2019-ncov/sars-cov-2 outbreak. from a translational perspective, if broadly applied, the network tools developed here could help develop effective treatment strategies for other emerging viral infections and other human complex diseases as well. in total, we collected dna sequences and protein sequences for 15 hcovs, including three most recent 2019-ncov/sars-cov-2 genomes, from the ncbi genbank database (28 january 2020, supplementary table s1 ). whole-genome alignment and protein sequence identity calculation were performed by multiple sequence alignment in embl-ebi database (https:// www.ebi.ac.uk/) with default parameters. the neighbor joining (nj) tree was computed from the pairwise phylogenetic distance matrix using mega x 84 with 1000 bootstrap replicates. the protein alignment and phylogenetic tree of hcovs were constructed by mega x 84 . we collected hcov-host protein interactions from various literatures based on our sizeable efforts. the hcov-associated host proteins of several hcovs, including sars-cov, mers-cov, ibv, mhv, hcov-229e, and hcov-nl63 were pooled. these proteins were either the direct targets of hcov proteins or were involved in critical pathways of hcov infection identified by multiple experimental sources, including highthroughput yeast-two-hybrid (y2h) systems, viral protein pull-down assay, in vitro co-immunoprecipitation and rna knock down experiment. in total, the virus-host interaction network included 6 hcovs with 119 host proteins (supplementary table s3 ). next, we performed kyoto encyclopedia of genes and genomes (kegg) and gene ontology (go) enrichment analyses to evaluate the biological relevance and functional pathways of the hcov-associated proteins. all functional analyses were performed using enrichr 85 . here, we collected drug-target interaction information from the drugbank database (v4.3) 86 , therapeutic target database (ttd) 87 , pharmgkb database, chembl (v20) 88 , bindingdb 89 , and iuphar/bps guide to pharmacology 90 . the chemical structure of each drug with smiles format was extracted from drug-bank 86 . here, drug-target interactions meeting the following three criteria were used: (i) binding affinities, including k i , k d , ic 50 , or ec 50 each ≤10 μm; (ii) the target was marked as "reviewed" in the uniprot database 91 ; and (iii) the human target was represented by a unique uni-prot accession number. the details for building the experimentally validated drug-target network are provided in our recent studies 13, 23, 28 . to build a comprehensive list of human ppis, we assembled data from a total of 18 bioinformatics and systems biology databases with five types of experimental evidence: (i) binary ppis tested by high-throughput yeasttwo-hybrid (y2h) systems; (ii) binary, physical ppis from protein 3d structures; (iii) kinase-substrate interactions by literature-derived low-throughput or high-throughput experiments; (iv) signaling network by literature-derived low-throughput experiments; and (v) literature-curated ppis identified by affinity purification followed by mass spectrometry (ap-ms), y2h, or by literature-derived low-throughput experiments. all inferred data, including evolutionary analysis, gene expression data, and metabolic associations, were excluded. the genes were mapped to their entrez id based on the ncbi database 92 as well as their official gene symbols based on genecards (https:// www.genecards.org/). in total, the resulting human protein-protein interactome used in this study includes 351,444 unique ppis (edges or links) connecting 17,706 proteins (nodes), representing a 50% increase in the number of the ppis we have used previously. detailed descriptions for building the human protein-protein interactome are provided in our previous studies 13, 23, 28, 93 . we posit that the human ppis provide an unbiased, rational roadmap for repurposing drugs for potential treatment of hcovs in which they were not originally approved. given c, the set of host genes associated with a specific hcov, and t, the set of drug targets, we computed the network proximity of c with the target set t of each drug using the "closest" method: where d(c, t) is the shortest distance between gene c and t in the human protein interactome. the network proximity was converted to z-score based on permutation tests: where d r and σ r were the mean and standard deviation of the permutation test repeated 1000 times, each time with two randomly selected gene lists with similar degree distributions to those of c and t. the corresponding p value was calculated based on the permutation test results. z-score < −1.5 and p < 0.05 were considered significantly proximal drug-hcov associations. all networks were visualized using gephi 0.9.2 (https://gephi.org/). for this network-based approach for drug combinations to be effective, we need to establish if the topological relationship between two drug-target modules reflects biological and pharmacological relationships, while also quantifying their network-based relationship between drug targets and hcov-associated host proteins (drug-drug-hcov combinations). to identify potential drug combinations, we combined the top lists of drugs. then, "separation" measure s ab was calculated for each pair of drugs a and b using the following method: where d á h i was calculated based on the "closest" method. our key methodology is that a drug combination is therapeutically effective only if it follows a specific relationship to the disease module, as captured by complementary exposure patterns in targets' modules of both drugs without overlapping toxic mechanisms 28 . we performed the gene set enrichment analysis as an additional prioritization method. we first collected three differential gene expression data sets of hosts infected by hcovs from the ncbi gene expression omnibus (geo). among them, two transcriptome data sets were sars-cov-infected samples from patient's peripheral blood 94 (gse1739) and calu-3 cells 95 (gse33267), respectively. one transcriptome data set was mers-cov-infected calu-3 cells 96 (gse122876). adjusted p value less than 0.01 was defined as differentially expressed genes. these data sets were used as hcov-host signatures to evaluate the treatment effects of drugs. differential gene expression in cells treated with various drugs were retrieved from the connectivity map (cmap) database 36 , and were used as gene profiles for the drugs. for each drug that was in both the cmap data set and our drug-target network, we calculated an enrichment score (es) for each hcov signature data set based on previously described methods 97 where j = 1, 2, …, s were the genes of hcov signature data set sorted in ascending order by their rank in the gene profiles of the drug being evaluated. the rank of gene j is denoted by v(j), where 1 ≤ v(j) ≤ r, with r being the number of genes (12,849) from the drug profile. then, es up/down was set to a up/down if a up/down > b up/down , and was set to −b up/down if b up/down > a up/down . permutation tests repeated 100 times using randomly generated gene lists with the same number of up-and down-regulated genes as the hcov signature data set were performed to measure the significance of the es scores. drugs were considered to have potential treatment effect if es > 0 and p < 0.05, and the number of such hcov signature data sets were used as the final gsea score that ranges from 0 to 3. coronaviruses-drug discovery and therapeutic options coronavirus infections-more than just the common cold sars and mers: recent insights into emerging coronaviruses host factors in coronavirus replication epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study early transmission dynamics in wuhan, china, of novel coronavirusinfected pneumonia putting the patient back together-social medicine, network medicine, and the limits of reductionism the $2.6 billion pill-methodologic and policy considerations silico oncology drug repositioning and polypharmacology individualized network-based drug repositioning infrastructure for precision oncology in the panomics era drug repurposing: new treatments for zika virus infection? a comprehensive map of molecular drug targets network-based approach to prediction and population-based validation of in silico drug repurposing systems biology-based investigation of cellular antiviral drug targets identified by gene-trap insertional mutagenesis understanding human-virus protein-protein interactions using a human protein complex-based analysis framework. msystems computational network biology: data, models, and applications repurposing of clinically developed drugs for treatment of middle east respiratory syndrome coronavirus infection fda-approved selective estrogen receptor modulators inhibit ebola virus infection repurposing of the antihistamine chlorcyclizine and related compounds for treatment of hepatitis c virus infection a screen of fda-approved drugs for inhibitors of zika virus infection identification of small-molecule inhibitors of zika virus infection and induced neural cell death via a drug repurposing screen prediction of drug-target interactions and drug repositioning via network-based inference a genome-wide positioning systems network algorithm for in silico drug repurposing deepdr: a network-based deep learning approach to in silico drug repositioning target identification among known drugs by deep learning from heterogeneous networks network-based prediction of drug-target interactions using an arbitrary-order proximity embedded deep forest network-based translation of gwas findings to pathobiology and drug repurposing for alzheimer's disease network-based prediction of drug combinations molecular evolution of human coronavirus genomes structure of the sars-cov nsp12 polymerase bound to nsp7 and nsp8 co-factors structure of sars coronavirus spike receptor-binding domain complexed with receptor genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding a pneumonia outbreak associated with a new coronavirus of probable bat origin cryo-em structure of the 2019-ncov spike in the prefusion conformation transient oligomerization of the sars-cov n protein-implication for virus ribonucleoprotein packaging the connectivity map: using gene-expression signatures to connect small molecules, genes, and disease a structure-informed atlas of human-virus interactions screening of an fda-approved compound library identifies four small-molecule inhibitors of middle east respiratory syndrome coronavirus replication in cell culture toremifene interacts with and destabilizes the ebola virus glycoprotein the cellular interactome of the coronavirus infectious bronchitis virus nucleocapsid protein and functional implications for virus biology determination of host proteins composing the microenvironment of coronavirus replicase complexes by proximity-labeling the central role of angiotensin i-converting enzyme in vertebrate pathophysiology effect of the angiotensin ii receptor blocker olmesartan on the development of murine acute myocarditis caused by coxsackievirus b3 the impact of statin and angiotensin-converting enzyme inhibitor/angiotensin receptor blocker therapy on cognitive function in adults with human immunodeficiency virus infection irbesartan, an fda approved drug for hypertension and diabetic nephropathy, is a potent inhibitor for hepatitis b virus entry by disturbing na(+)-dependent taurocholate cotransporting polypeptide activity the fda-approved drug irbesartan inhibits hbv-infection in hepg2 cells stably expressing sodium taurocholate co-transporting polypeptide identification of a novel transcriptional repressor (hepis) that interacts with nsp-10 of sars coronavirus host mtorc1 signaling regulates andes virus replication host cell mtorc1 is required for hcv rna replication adjuvant treatment with a mammalian target of rapamycin inhibitor, sirolimus, and steroids improves outcomes in patients with severe h1n1 pneumonia and acute respiratory failure middle east respiratory syndrome and severe acute respiratory syndrome: current therapeutic options and potential targets for novel therapies thiopurines in current medical practice: molecular mechanisms and contributions to therapy-related cancer thiopurine analogue inhibitors of severe acute respiratory syndrome-coronavirus papain-like protease, a deubiquitinating and deisgylating enzyme thiopurine analogs and mycophenolic acid synergistically inhibit the papain-like protease of middle east respiratory syndrome coronavirus interaction of the coronavirus nucleoprotein with nucleolar antigens and the host cell bird flu"), inflammation and anti-inflammatory/ analgesic drugs the development of anti-inflammatory drugs for infectious diseases melatonin: its possible role in the management of viral infections-a brief review melatonin in bacterial and viral infections with focus on sepsis: a review ebola virus disease: potential use of melatonin as a treatment one molecule, many derivatives: a never-ending interaction of melatonin with reactive oxygen and nitrogen species? on the free radical scavenging activities of melatonin's metabolites, afmk and amk anti-inflammatory effects of eplerenone on viral myocarditis remdesivir and chloroquine effectively inhibit the recently emerged novel coronavirus (2019-ncov) in vitro systematic identification of synergistic drug pairs targeting hiv antiviral potential of erk/mapk and pi3k/akt/mtor signaling modulation for middle east respiratory syndrome coronavirus infection as identified by temporal kinome analysis differential in vitro inhibition of feline enteric coronavirus and feline infectious peritonitis virus by actinomycin d toremifene is an effective and safe alternative to tamoxifen in adjuvant endocrine therapy for breast cancer: results of four randomized trials mers-cov pathogenesis and antiviral efficacy of licensed drugs in human monocyte-derived antigen-presenting cells emodin inhibits current through sars-associated coronavirus 3a protein emodin blocks the sars coronavirus spike protein and angiotensin-converting enzyme 2 interaction calmodulin interacts with angiotensin-converting enzyme-2 (ace2) and inhibits shedding of its ectodomain modulation of intracellular calcium and calmodulin by melatonin in mcf-7 human breast cancer cells activation of the c-jun nh2-terminal kinase pathway by coronavirus infectious bronchitis virus promotes apoptosis independently of c-jun the natural compound silvestrol is a potent inhibitor of ebola virus replication broad-spectrum antiviral activity of the eif4a inhibitor silvestrol against corona-and picornaviruses review of computational methods for virus-host protein interaction prediction: a case study on novel ebola-human interactions pleiotropic effects of statins: new therapeutic targets in drug design integrative functional genomics of hepatitis c virus infection identifies host dependencies in complete viral replication cycle crispr-cas9 genetic analysis of virushost interactions acute eosinophilic pneumonia related to a mesalazine suppository mesalamine induced eosinophilic pneumonia translational high-dimensional drug interaction discovery and validation using health record databases and pharmacokinetics models mega x: molecular evolutionary genetics analysis across computing platforms enrichr: a comprehensive gene set enrichment analysis web server 2016 update drugbank 4.0: shedding new light on drug metabolism therapeutic target database update 2016: enriched resource for bench to clinical drug target and targeted pathway information chembl: a large-scale bioactivity database for drug discovery bindingdb: a webaccessible database of experimentally determined protein-ligand binding affinities the iuphar/bps guide to pharmacology: an expertdriven knowledgebase of drug targets and their ligands uniprot: the universal protein knowledgebase database resources of the national center for biotechnology information conformational dynamics and allosteric regulation landscapes of germline pten mutations associated with autism compared to those associated with cancer expression profile of immune response genes in patients with severe acute respiratory syndrome cell host response to infection with novel human coronavirus emc predicts potential antivirals and important differences with sars coronavirus srebp-dependent lipidomic reprogramming as a broadspectrum antiviral target discovery and preclinical validation of drug indications using compendia of public gene expression data this work was supported by the national heart, lung, and blood institute of the national institutes of health (nih) under award number k99 hl138272 and r00 hl138272 to f.c. the content of this publication does not necessarily reflect the views of the cleveland clinic. key: cord-276178-0hrs1w7r authors: bangotra, deep kumar; singh, yashwant; selwal, arvind; kumar, nagesh; singh, pradeep kumar; hong, wei-chiang title: an intelligent opportunistic routing algorithm for wireless sensor networks and its application towards e-healthcare date: 2020-07-13 journal: sensors (basel) doi: 10.3390/s20143887 sha: doc_id: 276178 cord_uid: 0hrs1w7r the lifetime of a node in wireless sensor networks (wsn) is directly responsible for the longevity of the wireless network. the routing of packets is the most energy-consuming activity for a sensor node. thus, finding an energy-efficient routing strategy for transmission of packets becomes of utmost importance. the opportunistic routing (or) protocol is one of the new routing protocol that promises reliability and energy efficiency during transmission of packets in wireless sensor networks (wsn). in this paper, we propose an intelligent opportunistic routing protocol (iop) using a machine learning technique, to select a relay node from the list of potential forwarder nodes to achieve energy efficiency and reliability in the network. the proposed approach might have applications including e-healthcare services. as the proposed method might achieve reliability in the network because it can connect several healthcare network devices in a better way and good healthcare services might be offered. in addition to this, the proposed method saves energy, therefore, it helps the remote patient to connect with healthcare services for a longer duration with the integration of iot services. . sensor node architecture with application in e-healthcare. with the ever-increasing use of term green computing, the energy efficiency of wsn has seen a considerable rise. recently, an approach for green computing towards iot for energy efficiency has been proposed, which enhances the energy efficiency of wsn [4] . different types of methods and techniques were proposed and developed in the past to address the issue of energy optimization in wsn. another approach that regulates the challenge of energy optimization in sensor-enabled iot with the use of quantum-based green computing, makes routing efficient and reliable [5] . the problem of energy efficiency during the routing of data packets from source to target in case of iotoriented wsn is significantly addressed by another network-based routing protocol known as greedi [6] . it is imperative to mention here that iot is composed of energy-hungry sensor devices. the constraint of energy in sensor nodes has affected the transmission of data from one node to another and therefore, requires boundless methods, policies, and strategies to overcome this challenge [7] . with the ever-increasing use of term green computing, the energy efficiency of wsn has seen a considerable rise. recently, an approach for green computing towards iot for energy efficiency has been proposed, which enhances the energy efficiency of wsn [4] . different types of methods and techniques were proposed and developed in the past to address the issue of energy optimization in wsn. another approach that regulates the challenge of energy optimization in sensor-enabled iot with the use of quantum-based green computing, makes routing efficient and reliable [5] . the problem of energy efficiency during the routing of data packets from source to target in case of iot-oriented wsn is significantly addressed by another network-based routing protocol known as greedi [6] . it is imperative to mention here that iot is composed of energy-hungry sensor devices. the constraint of energy in sensor nodes has affected the transmission of data from one node to another and therefore, requires boundless methods, policies, and strategies to overcome this challenge [7] . the focus of this paper was to put forward an intelligent opportunistic routing protocol so that the consumption of resources particularly during communication could be optimized, because the sensors 2020, 20, 3887 3 of 21 alleyway taken to transmit a data packet from a source node to the target node is determined by the routing protocol. routing is a complex task in wsn because it is different from designing a routing protocol in traditional networks. in wsn, the important concern is to create an energy-efficient routing strategy to route packet from source to destination, because the nodes in the wsn are always energy-constrained. the problem of energy consumption while routing is managed with the use of a special type of routing protocol known as the opportunistic routing protocol. the opportunistic routing (or) is also known as any path routing that has gained huge importance in the recent years of research in wsn [8] . this protocol exploits the basic feature of wireless networks, i.e., broadcast transmission of data. the earlier routing strategies consider this property of broadcasting as a disadvantage, as it induces interference. the focal notion behind or is to take the benefit of spreading the behavior of the wireless networks such that broadcast from one node can be listened by numerous nodes. rather than selecting the next forwarder node in advance, the or chooses the next forwarder node robustly at the time of data transmission. it was shown that or gives better performance results than traditional routing. in or, the best opportunities are searched to transmit the data packets from source to destination [9] . the hop-by-hop communication pattern is used in the or even when there is no source-to-destination linked route. the or protocols proposed in recent times by different researchers are still belligerent with concerns pertaining to energy efficiency and the reliable delivery of data packets. the proposed or routing protocol given in this paper was specifically meant for wsn, by taking into account the problems that surface during the selection of relay candidates and execution of coordination protocol. the proposed protocol intelligently selects the relay candidates from the forwarder list by using a machine learning technique to achieve energy efficiency. the potential relay node selection is a multi-class with multiple feature-based probabilistic problems, where the inherent selection of relay node is dependent upon each node's characteristics. the selection of a node with various characteristics for a node is a supervised multiclass non-linearly separable problem. in this paper, the relay node selection algorithm is given using naïve baye's machine learning model. the organization of this paper is as follows. section 2 presents the related work in the literature regarding or and protocols. the various types of routing protocols are given in section 3. section 4 describes or with examples, followed by the proposed intelligent or algorithm for forwarder node selection in section 5. section 6 depicts the simulation results of the proposed protocol by showing latency, network lifetime, throughput, and energy efficiency. section 7 presents a proposed framework for integration iot with wsn for e-healthcare. this architecture can be useful in many e-healthcare applications. section 8 presents the conclusion and future. achieving reliable delivery of data and energy efficiency are two crucial tasks in wsns. as the sensor nodes are mostly deployed in an unattended environment and the likelihood of any node going out of order is high, the maintenance and management of topology is a rigorous task. therefore, the routing protocol should accommodate the dynamic nature of the wsns. opportunistic routing protocols developed in the recent past years provided trustworthy data delivery but they are still deficient in providing energy-efficient data transmission between the sensor nodes. some latest research on or, experimented by using the formerly suggested routing metrics and they concentrated on mutual cooperation among nodes. geraf [10] (geographic random forwarding) described a novel forwarding technique based on the geographical location of the nodes involved and random selection of the relaying node via contention among receivers. exclusive opportunistic multi-hop routing for wireless networks [11] (exor) is an integrated routing and mac protocol for multi-hop wireless networks, in which the best of multiple receivers forwards each packet. this protocol is based on the expected transmission count (etx) metric. the etx was measured by hop count from the source to the destination and the data packet traveled through the minimum number of hops. exor achieves higher throughput than traditional sensors 2020, 20, 3887 4 of 21 routing algorithms but it still has few limitations. exor contemplates the information accessible at the period of transmission only, and any unfitting information because of recent updates could worsen its performance and could lead to packet duplication. other than this, there is another limitation with exor, as it always seeks coordination among nodes that causes overhead, in case of large networks. minimum transmission scheme-optimal forwarder list selection in opportunistic routing [12] (mts) is another routing protocol that uses mts instead of etx as in exor. the mts-based algorithm gives fewer transmissions as compared to etx-based exor. simple, practical, and effective opportunistic routing for short-haul multi-hop wireless networks [13] . in this protocol, the packet duplication rate was decreased. this is a simple algorithm and can be combined with other opportunistic routing algorithms. spectrum aware opportunistic routing [14] (saor) is another routing protocol for the cognitive radio network. it uses optimal link transmission (olt) as a cost metric for positioning the nodes in the forwarder list. saor gives better qos, reduced end-to-end delay, and improved throughput. energy-efficient opportunistic routing [15] (eeor) calculates the cost for each node to transfer the data packets. the eeor takes less time than exor for sending and receiving the data packets. trusted opportunistic routing algorithm for vanet [16] (tmcor) gives a trust mechanism for opportunistic routing algorithm. it also defines the trade-off between the cost metric and the safety factor. a novel socially aware opportunistic routing algorithm in mobile social networks [17] considered three parameters, namely social profile matching, social connectivity matching, and social interaction. this gives a high probability of packet delivery and routing efficiency. ensor-opportunistic routing algorithm for relay node selection in wsns is another algorithm where the concept of an energy-efficient node is implemented [18] . the packet delivery rate of ensor is better than geraf. economy-a duplicate free [19] is the only or protocol that uses token-based coordination. this algorithm ensures the absence of duplicate packet transmissions. with the advent of the latest network technologies, the virtualization of networks along with its related resources has made networks more reliable and efficient. the virtual network functions are used to solve the problems related to service function chains in cloud-fog computing [20] . further, iot works with multiple network domains, and the possibility of compromising the security and confidentiality of data is always inevitable. therefore, the use of virtual networks for service function chains in cloud-fog computing under multiple network domains, leads to saving network resources [21] . in recent times, the cloud of things (cot) has gained immense popularity, due to its ability to offer an enormous amount of resources to wireless networks and heterogeneous mobile edge computing systems. the cot makes the opportunistic decision-making during the online processing of tasks for load sharing, and makes the overall network reliable and efficient [22] . the cloud of things framework can significantly improve communication gaps between cloud resources and other mobile devices. in this paper, the author(s) proposed a methodology for offloading computation in mobile devices, which might reduce failure rates. this algorithm reduces failure rates by improving the control policy. in recent times, wsn used virtualization techniques to offer energy-efficient and fault-tolerant data communication to the immensely growing service domain for iot [23] . with the application of wsn in e-healthcare, the wireless body area network (wban) gained a huge response in the healthcare domain. the wban is used to monitor patient data by using body sensors, and transmits the acquired data, based on the severity of the patients' symptoms, by allocating a channel without contention or with contention [24] . eeor [15] is an energy-efficient protocol that works on transmission power as a major parameter. this protocol discussed two cases that involved constant and dynamic power consumption models. these models are known as non-adjustable and adjustable power models. in the first model, the algorithm calculated the expected cost at each node and made a forwarder list on the source node based on this cost. the forwarder list was sorted in increasing order of expected cost and the first node on the list became the next-hop forwarder. as eeor is an opportunistic routing protocol, broadcasting is utilized and the packets transmitted might be received by each node on the forwarder list. in this, the authors propose algorithms for fixed-power calculation, adjustable power calculation, sensors 2020, 20, 3887 5 of 21 and opportunistic power calculation. this algorithm was compared with exor [11] by simulation in the tossim simulator. the results showed that eeor always calculated the end-to-end cost based on links from the source to destination. eeor followed distance vector routing for storing the routing information inside each sensor node. the expected energy consumption cost was updated inside each node, after each round of packet transmission. data delivery was guaranteed in this protocol. additionally, according to the simulation results, packet duplication was significantly decreased. the mdor [25, 26] protocol worked on the distance between the source to relay nodes. in this, the authors proposed an algorithm that calculated the distance to each neighbor from the source node and found out the average distance node. the average distance node was used by the source as a next-hop forwarder. the authors also stated that, to increase the speed and reliability of transmission, the strength of the signal was very important. the signal power depended on the distance between the sender and receiver. if a node sent a packet to the nearest node, then it might take more hops and this would decrease the lifetime of the network. another problem addressed in this protocol was to reduce energy consumption at each node through the dynamic energy consumption model. this model consumed energy according to the packet size and transmitted the packet by amplifying it according to the distance between the source and the relay nodes. mdor always chose the middle position node to optimize energy consumption in amplifying the packets. the mdor simulation results showed that the energy consumption was optimized and it was suitable for certain applications of wsn like environment monitoring, forest fire detection, etc. opportunistic routing introduced the concept of reducing the number of retransmissions to save energy and taking advantage of the broadcasting nature of the wireless networks. with broadcasting, the routing protocol could discover as many paths in the network as possible. the data transmission would take place on any of these paths. if a particular path failed, the transmission could be completed by using some other path, using the forwarder list that had the nodes with the same data packet. the protocols that were responsible for data transmission in wsn were broadly ordered into two sets [2] , namely, (i) old-fashioned routing, and (ii) opportunistic routing. in the traditional routing, also known as old-fashioned routing techniques, the focus was on finding the route with a minimum number of intermediate nodes from the source to the destination, without taking into consideration some of the important factors like throughput, quality of links, reliability, etc. a small comparison [27] of the routing categories is shown in table 1 . as it is clear from the literature that energy consumption of a sensor node had a considerable impact on the lifetime and quality of the wireless sensor network, therefore, it becomes vital to design energy-efficient opportunistic routing protocols to maximize the overall lifetime of the network and also to enhance the quality of the sensor network. there are few methods in the literature listed below that might be useful to save the life of the sensor network. scheduling of duty cycle • energy-efficient medium access control (ee-mac) • energy-efficient routing • node replacements (not possible in unattended environments) of the above-mentioned methods for energy saving, energy-efficient routing is the most central method for the vitality of the wsn. as this method involved the transmission of signals, i.e., receiving and sending, it took about 66.66 percent of the total energy of the network [28] . therefore, it became relevant that an opportunistic routing protocol that enhanced the vitality of the sensor network might be designed for enhancing the overall life span of the sensor network. or broadcasts a data packet to a set of relay candidates that is overheard by the neighboring nodes, whereas in traditional routing a node is (pre)-selected for each transmission. then, relay candidates that are part of the forwarders list and who have successfully acknowledged the data packet, run a protocol called coordination protocol between themselves, to choose the best relay candidate to onward the data packet. in other words, or is abstractly comprised of these three steps: step 1: broadcast a data packet to the relay candidates (this will prepare the forwarder list). step 2: select the best relay by using a coordination protocol among the nodes in the forwarder list. step 3: forward the data packet to the selected relay node. considering an example shown in figure 2 , where the source node s sends a packet to the destination node d, through nodes r1, r2, r3, r4, and r5. first, s broadcasts a packet. the relay nodes r1, r2, and r3 might become the forwarder nodes. further, if r2 is chosen as a potential forwarder, then r4 and r5 might become relay nodes. similarly, if r5 is the forwarder node, then it forwards the data packets to the destination node d. energy balance of the above-mentioned methods for energy saving, energy-efficient routing is the most central method for the vitality of the wsn. as this method involved the transmission of signals, i.e., receiving and sending, it took about 66.66 percent of the total energy of the network [28] . therefore, it became relevant that an opportunistic routing protocol that enhanced the vitality of the sensor network might be designed for enhancing the overall life span of the sensor network. or broadcasts a data packet to a set of relay candidates that is overheard by the neighboring nodes, whereas in traditional routing a node is (pre)-selected for each transmission. then, relay candidates that are part of the forwarders list and who have successfully acknowledged the data packet, run a protocol called coordination protocol between themselves, to choose the best relay candidate to onward the data packet. in other words, or is abstractly comprised of these three steps: step 1: broadcast a data packet to the relay candidates (this will prepare the forwarder list). step 2: select the best relay by using a coordination protocol among the nodes in the forwarder list. step 3: forward the data packet to the selected relay node. considering an example shown in figure 2 , where the source node s sends a packet to the destination node d, through nodes r1, r2, r3, r4, and r5. first, s broadcasts a packet. the relay nodes r1, r2, and r3 might become the forwarder nodes. further, if r2 is chosen as a potential forwarder, then r4 and r5 might become relay nodes. similarly, if r5 is the forwarder node, then it forwards the data packets to the destination node d. opportunistic routing derived the following rewards: • the escalation in reliability. by using this routing strategy, the reliability of wsn increased significantly, as this protocol transmitted the data packet through any possible link rather than any pre-decided link. therefore, this routing protocol provided additional links that could act as back up links and thus reduced the chances of transmission failure. the escalation in transmission range. with this routing protocol, the broadcast nature of the wireless medium provided an upsurge in the transmission range, as all links irrespective of their location and quality of data packets were received. hence, the data transmission could reach the farthest relay node successfully. opportunistic routing derived the following rewards: • the escalation in reliability. by using this routing strategy, the reliability of wsn increased significantly, as this protocol transmitted the data packet through any possible link rather than any pre-decided link. therefore, this routing protocol provided additional links that could act as back up links and thus reduced the chances of transmission failure. the escalation in transmission range. with this routing protocol, the broadcast nature of the wireless medium provided an upsurge in the transmission range, as all links irrespective of their location and quality of data packets were received. hence, the data transmission could reach the farthest relay node successfully. in wsn, the sensor nodes could be deployed in two ways, randomly or manually. most applications require the random deployment of nodes in the area under consideration. initially, each node is loaded with the same amount of battery power. as soon as the network starts functioning, the nodes start consuming energy. to make the network energy efficient, the protocol used for transmitting data packets must consume less battery power and the calculation of the energy consumption network model and energy model should be formulated. in the upcoming subsection, these two models are discussed and these are depicted as assumptions, to carry out smooth working of the protocol. the n sensors are distributed in a square area of size 500 * 500 square meters. this network formed a graph g = (n, m), with the following possessions: . . , n n } is the set of vertices representing sensor nodes. • m is considered to be a set of edges representing the node-to-node links. the neighboring list nbl(n i ) consists of nodes that are in the direct link to the n i . the data traffic is assumed to be traveling from the sensor nodes toward the base station. if a packet delivery is successful, then the acknowledgment (ack) for the same is considered to travel the same path back to the source. the lifespan of a wsn depends on the endurance of each node, while performing network operations. the sensor nodes rely on the battery life to perform network operations. the energy cost model considered here is the first-order energy model for wsn [25] . various terms used in equations (1)-(3) are defined in table 2 . combined vitality cost of radio board of a sensor for communication of a data packets energy consumed in the transmission of n-bit packet up to l distance: energy consumed in the transmission of n-bit packet: sensors 2020, 20, 3887 8 of 21 sensor board-full operation, radio board-full operation, cpu board-sleep, wakeup for creating messages only. the proposed protocol uses these assumptions as preliminaries. a new algorithm is proposed in the next section, for solving the issue of energy efficiency and the reliability of opportunistic routing in wsn. let there be n nodes in the wsn, where each node has k neighbors, i.e., n 1 , n 2 , . . . , n k and each neighbor nodes are represented by x 1 , x 2 , . . . , x n attributes. in this case, the number of neighbors (k) might vary for the different nodes at a particular instance. additionally, it was assumed that the wireless sensor network is spread over an area of 500 × 500 square meters. let us assume that a node a ∈ n and had neighbors as na 1 , na 2 , . . . , na k , with respective features like node id, location, prr (packet reception ratio), residual energy (re) of nodes, and distance (d), which are represented by x 1 , x 2 , . . . , x n , respectively. the goal was to intelligently find a potential relay node a, say ar, such that ar ∈ {na 1 , na 2 , . . . , na k }. in the proposed machine learning-based protocol for the selection of potential forwarder, the packet reception ratio, distance, and outstanding energy of node was taken into consideration. the packet reception ratio (prr) [29] is also sometimes referred to as psr (packet success ratio). the psr was computed as the ratio of the successfully received packets to the sent packets. a similar metric to the prr was the per (packet error ratio), which could be computed as (1-prr). a node loses a particular amount of energy during transmission and reception of packets. accordingly, the residual energy in a node gets decreased [30] . the distance (d) was the distance between the source node and the respective distance of each sensor node in the forwarder list. the potential relay node selection was multi-class, with multiple features-based probabilistic problems, where the inherent selection of the relay node was dependent upon each node feature. the underlying probabilistic-based relay node selection problem could be addressed intelligently by building a machine learning model. the selection of a node with 'n' characteristics for a given node 'a' could be considered a supervised multiclass non-linearly separable problem. in this algorithm, the naïve baye's classifier was used to find the probability of node a to reach one of its neighbors, i.e., {n 1 , n 2 , . . . , n k }. we computed the probability, p(n 1 , n 2 , . . . , n k |a). the node with maximum probability, i.e., p(n 1 , n 2 , . . . , n k |a) was selected. the probability p of selecting an individual relay node of the given node a could be computed individually for each node, as shown respectively for each node in equation (4). where p(na k |a) denotes the probability of node a to node k. furthermore, the probability computation of node a to na 1 is such that na 1 is represented by the corresponding characteristics x 1 , x 2 , . . . , xn, which means to find the probability to select the relay node na 1 , given that feature x 1 , na 1 given that feature x 2 , na 1 given that feature x 3 , and so on. the individual probability of relay node selection, given that the node characteristics might be computed by using naïve bayes conditional probability, is shown in equation (5). sensors 2020, 20, 3887 9 of 21 where i = 1, 2, 3, . . . , n and p(xi|a) is called likelihood, p(a) is called the prior probability of the event, and p(xi) is the prior probability of the consequence. the underlying problem is to find the relay node a that has the maximum probability, as shown in equation (6). table 3a-x represent the neighbor sets {na 1 , na 2 , . . . , na k } along with their feature attributes as {x 1 , x 2 , x 3 , . . . , x n } of node a. the working of iop is comprised of two phases, i.e., phase i (forwarder_set_selection) and phase ii (forwarder_node_selection). in phase i, the authors used algorithm 1 for the forwarder set selection. in this step, the information collection task was initiated after the nodes were randomly deployed in the area of interest, with specific dimensions. the beginning of the phase started with a broadcast of "hello packet" which contained the address and the location of the sending node. if any node received this packet, it sent an acknowledgment to the source and was added to the neighbor list. this process was repeated again and again, but not more than the threshold, to calculate the prr of each node and the neighbor list was formed using this procedure repeatedly. from the neighbor list and the value of prr, the forwarder set was extracted. the pre-requisite for the working of the second phase was the output of the first phase. the forwarder set generated from algorithm 1 was the set of all nodes that had the potential to forward the data packets. however, all nodes in the set could not be picked for transmission, as this would lead to duplication of packets in the network. to tackle this situation, only one node from the forwarder list should be selected to transmit the packet to the next-hop toward the destination. this was accomplished using algorithm 2, which took a forwarder node list as input and selected a single node as a forwarder. algorithm 2 used a machine-learning technique called naïve baye's classifier, to select the forwarder node intelligently. the proposed method of relay node selection using iop could be understood by considering an example of wsn shown in figure 2 and using the naïve baye's algorithm on the generic data available in table 4 , to find the optimal path in terms of energy efficiency and reliability from source node s to destination node d. therefore, by using the proposed naïve baye's classifier method, the probability of selection of a relay node r1, r2, or r3 from source node s was denoted by p(r1, r2, r3|s), which could be calculated using equation (7). where, putting the values in the above equations from table 4 again, inputting the values in the above equations (12) declare three float variables x 1 , x 2 , and x 3 to represent the properties of ri, i.e., prr (packet reception ratio), re (residual energy), and d (distance), respectively. for each node ri∈ fl(s) repeat compute p(ri|s)//probability of selection of ri given s, i.e., p k = p(r i |s) for i = 1, 2 . . . , n and assign k←i 4. compute the probability of p(r i |s) by computing the probability of each parameter separately, given s. make an unsorted array of probability values of n nodes, i.e., r1, r2, . . . , rn from step 6. for i = 1 to n and k = i, arrprob[ri]←p k //to find the node with maximum probability. 6. select the first node of the array arrprob[0] as the node with maximum value pmax i.e., pmax←arrprob[0] 7. go through the rest of the elements of the array, i.e., from the 2nd element to the last (n − 1) element, for i = 1 to n − 1. for when the end of the array is reached, then the current value of the pmax is the greatest value in the array, pmax←arrprob[i]. 10. the node ri with pmax value is selected as a relay node from the forwarder list, as the node with the highest probability. the node with the next highest probability acts as a relay node in case the first selected relay node fails to broadcast. 11. broadcast transmission of the data packet as {ri, coordinates, data} 12. destination node d is reached, if yes, go to step 15. else, apply algorithm 1 on ri s←ri and go to step 2. 13. end output: a potential forwarder node is selected from the list of forwarder nodes. again, putting the values in the above equations finally using the proposed method of relay node selection using naïve baye's algorithm, we could compute probability p(r1, r2, r3 s) , using equation (26) . p(r1, r2, r3 s) = max(p(r1 s), p(r2 s), p(r3 s) = max(0.001, 0.002, 0.001) (26) thus, node r2 would be selected as the relay node in the forwarder list of r1, r2, and r3 for source node s. similarly, the process was followed again for the neighbors of s, which consequently would check the neighbors of r1, r2, and r3. the tables 5-7 describe the features of neighboring nodes of r1, r2, and r3, respectively. after the execution of phase i and phase ii on the above said example, the final route was intelligently selected for the onward transmission of the data packet from source node s to destination node d, using the naïve baye's algorithm shown in figure 3 . node_id location prr (j) (m) r5 r20005 (49,79) 0.6 0.7 11 d after the execution of phase i and phase ii on the above said example, the final route was intelligently selected for the onward transmission of the data packet from source node s to destination node d, using the naïve baye's algorithm shown in figure 3 . figure 3 gives the details about the route selected using the iop. the source node s broadcasts the data packet among its neighboring nodes, using algorithm 1 to create a forwarders list. the node r1, r2, and r3 in the figure, were selected as the nodes in the forwarders list. these were the potential nodes that would be used for the selection of a potential forwarder node. here, r2 was selected as the potential node using algorithm 2. the same procedure was adopted again and until the data reached its final destination. the final route was selected intelligently using iop is s→r2→r5→d. with the end goal of examination and comparison of the proposed or protocol, the simulation was performed in matlab. the simulation used the environment provided by the matlab to simulate the computer networks and other networks. matlab provides a good scenario to design a network of sensor nodes and also to define a sensor node and its characteristics. the simulation results were compared with the results of the eeor [25] and the mdor [26] protocols. table 8 below shows the parameter setting of the network. figure 3 gives the details about the route selected using the iop. the source node s broadcasts the data packet among its neighboring nodes, using algorithm 1 to create a forwarders list. the node r1, r2, and r3 in the figure, were selected as the nodes in the forwarders list. these were the potential nodes that would be used for the selection of a potential forwarder node. here, r2 was selected as the potential node using algorithm 2. the same procedure was adopted again and until the data reached its final destination. the final route was selected intelligently using iop is s→r2→r5→d. with the end goal of examination and comparison of the proposed or protocol, the simulation was performed in matlab. the simulation used the environment provided by the matlab to simulate the computer networks and other networks. matlab provides a good scenario to design a network of sensor nodes and also to define a sensor node and its characteristics. the simulation results were compared with the results of the eeor [25] and the mdor [26] protocols. table 8 below shows the parameter setting of the network. the motes are haphazardly deployed in 500 × 500 m field. the nodes are deployed in such a way that these can approximately cover the whole application area. the base station position is 250 × 250 m in the field. the field area was considered a physical world environment. the proposed or protocol started working immediately after the deployment process was complete. figure 4 below represents the unplanned deployment of the nodes in the area of consideration. the motes are haphazardly deployed in 500 × 500 m field. the nodes are deployed in such a way that these can approximately cover the whole application area. the base station position is 250 × 250 m in the field. the field area was considered a physical world environment. the proposed or protocol started working immediately after the deployment process was complete. figure 4 below represents the unplanned deployment of the nodes in the area of consideration. energy efficiency was the main objective of the proposed algorithm. it could be calculated as the overall energy consumption in the network for the accomplishment of diverse network operations. in matlab, the simulation worked based on simulation rounds. the simulation round was termed as packets transmission from a single source to a single destination. in matlab, when the simulation starts, a random source is chosen to start transmission and this node makes a forwarder list and starts executing the proposed protocol. one round of simulation represents successful or unsuccessful transmissions of packets from one source in the network. for each round, different source and relay nodes are selected. this process continues until at least one node is out of its energy. the energy efficiency was calculated as the total energy consumption after each round in the network. after the operation of the network starts, the sensor's energy starts decaying. this energy reduction was due to network operations like setting up the network, transmission, reception, and acknowledging the data packets, processing of data, and sensing of data. as the nodes decayed, their energy consumption kept increasing per round, as can be seen in figure 5 below. it can be seen in the figure that energy consumption for the proposed or protocol was less, as compared to the other two algorithms. this was because the proposed or protocol distributed energy consumption equally to energy efficiency was the main objective of the proposed algorithm. it could be calculated as the overall energy consumption in the network for the accomplishment of diverse network operations. in matlab, the simulation worked based on simulation rounds. the simulation round was termed as packets transmission from a single source to a single destination. in matlab, when the simulation starts, a random source is chosen to start transmission and this node makes a forwarder list and starts executing the proposed protocol. one round of simulation represents successful or unsuccessful transmissions of packets from one source in the network. for each round, different source and relay nodes are selected. this process continues until at least one node is out of its energy. the energy efficiency was calculated as the total energy consumption after each round in the network. after the operation of the network starts, the sensor's energy starts decaying. this energy reduction was due to network operations like setting up the network, transmission, reception, and acknowledging the data packets, processing of data, and sensing of data. as the nodes decayed, their energy consumption kept increasing per round, as can be seen in figure 5 below. it can be seen in the figure that energy consumption for the proposed or protocol was less, as compared to the other two algorithms. this was because the proposed or protocol distributed energy consumption equally to all nodes, so that every node could survive up to their maximum lifetime. hence, the proposed or protocol was more energy-efficient than mdor and eeor. sensors 2020, 20, x for peer review 18 of 24 all nodes, so that every node could survive up to their maximum lifetime. hence, the proposed or protocol was more energy-efficient than mdor and eeor. latency can be measured as the time elapsed between sending the packet and receiving the same at the base station. this is also called as end-to-end delay for the packets to be reached at the destination. the communication in wireless sensor networks is always from source nodes to the sink station. latency can be measured as the time elapsed between sending the packet and receiving the same at the base station. this is also called as end-to-end delay for the packets to be reached at the destination. the communication in wireless sensor networks is always from source nodes to the sink station. in the random deployment of nodes, some nodes are able to communicate directly with the base station. while some nodes follow multi-hop communication, i.e., source nodes have to go through relay nodes to forward the data packet toward the base station. hence, in some cases, the network delay can be very low and in some cases, it can be high. hence in figure 6 , the values of end-to-end delay after each communication in each round are plotted. it can be seen that the proposed or protocol has a good latency, as compared to the other two protocols. latency can be measured as the time elapsed between sending the packet and receiving the same at the base station. this is also called as end-to-end delay for the packets to be reached at the destination. the communication in wireless sensor networks is always from source nodes to the sink station. in the random deployment of nodes, some nodes are able to communicate directly with the base station. while some nodes follow multi-hop communication, i.e., source nodes have to go through relay nodes to forward the data packet toward the base station. hence, in some cases, the network delay can be very low and in some cases, it can be high. hence in figure 6 , the values of end-to-end delay after each communication in each round are plotted. it can be seen that the proposed or protocol has a good latency, as compared to the other two protocols. the throughput of a network can be measured in different ways. throughput is calculated as the average number of packets received successfully at the base station per second in each round. figure 7 represents the throughput for each round. the proposed or protocol has good throughput, as compared to the other two. as the proposed or protocol is efficient in energy consumption, the sensor nodes are able to survive and communicate for a long time in the network. as long as the communication goes on, the base station would continue to receive the packets. sensors 2020, 20, x for peer review 19 of 24 the throughput of a network can be measured in different ways. throughput is calculated as the average number of packets received successfully at the base station per second in each round. figure 7 represents the throughput for each round. the proposed or protocol has good throughput, as compared to the other two. as the proposed or protocol is efficient in energy consumption, the sensor nodes are able to survive and communicate for a long time in the network. as long as the communication goes on, the base station would continue to receive the packets. network lifetime for wireless sensor networks is dependent upon the energy consumption in the network. when the energy of the network is 100 percent, the network lifetime would also be 100 percent. however, as the nodes start operating in the network, the network lifespan would start to reduce. figure 8 represents the percentage of lifetime remaining after each round of simulation. proposed or protocol has a good network lifetime due to the lower energy consumption in the network. network lifetime for wireless sensor networks is dependent upon the energy consumption in the network. when the energy of the network is 100 percent, the network lifetime would also be 100 percent. however, as the nodes start operating in the network, the network lifespan would start to reduce. network lifetime for wireless sensor networks is dependent upon the energy consumption in the network. when the energy of the network is 100 percent, the network lifetime would also be 100 percent. however, as the nodes start operating in the network, the network lifespan would start to reduce. figure 8 represents the percentage of lifetime remaining after each round of simulation. proposed or protocol has a good network lifetime due to the lower energy consumption in the network. the packet loss is referred to as the number of packets that are not received at the destination. to calculate the number of packets lost during each round of the simulation, packet sequence numbers are used. whenever a source tries to send packets to a destination, it inserts a sequence number. later, on packet reception, these packet sequence numbers are checked for continuity. if a certain sequence number is missing then it is referred to as packet loss. packet loss recorded per round of simulation and presented in figure 9 . it can be depicted from the figure that packet loss for the proposed protocol is less, as compared to eeor and mdor. this is because the forwarder node selection algorithm runs on each relay and source node. this algorithm calculates the probability of successful transmission through a neighbor node. this also increases the reliability of the protocol and provides accurate transmissions. sensors 2020, 20, x for peer review 20 of 24 the packet loss is referred to as the number of packets that are not received at the destination. to calculate the number of packets lost during each round of the simulation, packet sequence numbers are used. whenever a source tries to send packets to a destination, it inserts a sequence number. later, on packet reception, these packet sequence numbers are checked for continuity. if a certain sequence number is missing then it is referred to as packet loss. packet loss recorded per round of simulation and presented in figure 9 . it can be depicted from the figure that packet loss for the proposed protocol is less, as compared to eeor and mdor. this is because the forwarder node selection algorithm runs on each relay and source node. this algorithm calculates the probability of successful transmission through a neighbor node. this also increases the reliability of the protocol and provides accurate transmissions. a significant improvement could be seen in the graphs after the simulation is complete. figure 5 shows the total energy consumption after each round of packet transmission is complete. here, the round was termed as packet transmissions in between single source and destination. mdor showed the highest energy consumption, followed by eeor and the proposed protocol. this was because mdor wasted more energy in the initial setup. however, the dynamic energy consumption considerations led the network to survive for a long time, as shown in figure 8 . in the case of eeor in figure 5 , it consumed lesser energy in transmission and the initial setup for opportunistic selection of relay nodes was based on the power level. however, when it comes to lifetime, eeor failed to perform better, as it considered the network to be dead when any one of the nodes ran out of its energy. eeor chose one node as a source and continued transmissions opportunistically, which resulted in a significant reduction in the power level of a single node. the proposed protocol gave the best results, as in each round, the source node was based on the intelligent model to change the next-hop relay node. figure 6 presents the average end-to-end delay per round, generated by the simulation, and the proposed protocol worked significantly better as the next-hop selection was based on an intelligent algorithm. the proposed algorithm helped to significantly reduce average a significant improvement could be seen in the graphs after the simulation is complete. figure 5 shows the total energy consumption after each round of packet transmission is complete. here, the round was termed as packet transmissions in between single source and destination. mdor showed the highest energy consumption, followed by eeor and the proposed protocol. this was because mdor wasted more energy in the initial setup. however, the dynamic energy consumption considerations led the network to survive for a long time, as shown in figure 8 . in the case of eeor in figure 5 , it consumed lesser energy in transmission and the initial setup for opportunistic selection of relay nodes was based on the power level. however, when it comes to lifetime, eeor failed to perform better, as it considered the network to be dead when any one of the nodes ran out of its energy. eeor chose one node as a source and continued transmissions opportunistically, which resulted in a significant reduction in the power level of a single node. the proposed protocol gave the best results, as in each round, the source node was based on the intelligent model to change the next-hop relay node. figure 6 presents the average end-to-end delay per round, generated by the simulation, and the proposed protocol worked significantly better as the next-hop selection was based on an intelligent algorithm. the proposed algorithm helped to significantly reduce average end-to-end delays. figures 7 and 9 showed the reliability and availability performances of all protocols, including the proposed protocol that showed significantly better performance. this meant that the proposed protocol was a new generation protocol that has potential in many applications of wsn. in recent years, wsn saw its applications growing exponentially with the integration of iot. this gave a new purpose to the overall utility of data acquisition and transmission. with the integration of wsn with iot, the iot is making a big impact in diverse areas of life, i.e., e-healthcare, smart farming, traffic monitoring and regulation, weather forecast, automobiles, smart city, etc. all these applications are hugely dependent on the availability of real-time accurate data. healthcare with iot is one such area that involves critical decision making [31] [32] [33] . the proposed approach makes use of intelligent routing and, therefore, would help in making reliable and accurate delivery of data to the integrated healthcare infrastructure, for proper care of the patients. the proposed framework for e-healthcare is shown in figure 10 . as the proposed algorithm saves energy, the healthcare devices that are sensor-enabled can work for longer duration, and easy deployment and data analysis is possible due to iot integration [34] [35] [36] [37] [38] . according to the proposed architecture, there can be any different kind of sensor nodes, such as smart wearables, sensors collecting health data like temperature, heartbeat, number of steps taken every day, sleep patterns, etc. these factors have a correlation with different existing diseases. the best part of the integration of iot and wsn is that, with the help of sensors, data are collected and the same is stored in the cloud due to iot integration. once the health data is stored in the cloud, this cloud is a health-record cloud that belongs to a specific hospital or a public domain cloud. these cloud data can be accessed by healthcare professionals in a different way, to analyze the data and also provide feedback to a specific patient and group of patients. in recent years, wsn saw its applications growing exponentially with the integration of iot. this gave a new purpose to the overall utility of data acquisition and transmission. with the integration of wsn with iot, the iot is making a big impact in diverse areas of life, i.e., e-healthcare, smart farming, traffic monitoring and regulation, weather forecast, automobiles, smart city, etc. all these applications are hugely dependent on the availability of real-time accurate data. healthcare with iot is one such area that involves critical decision making [31] [32] [33] . the proposed approach makes use of intelligent routing and, therefore, would help in making reliable and accurate delivery of data to the integrated healthcare infrastructure, for proper care of the patients. the proposed framework for e-healthcare is shown in figure 10 . as the proposed algorithm saves energy, the healthcare devices that are sensor-enabled can work for longer duration, and easy deployment and data analysis is possible due to iot integration [34] [35] [36] [37] [38] . according to the proposed architecture, there can be any different kind of sensor nodes, such as smart wearables, sensors collecting health data like temperature, heartbeat, number of steps taken every day, sleep patterns, etc. these factors have a correlation with different existing diseases. the best part of the integration of iot and wsn is that, with the help of sensors, data are collected and the same is stored in the cloud due to iot integration. once the health data is stored in the cloud, this cloud is a health-record cloud that belongs to a specific hospital or a public domain cloud. these cloud data can be accessed by healthcare professionals in a different way, to analyze the data and also provide feedback to a specific patient and group of patients. in the recent epidemic of covid-19, telemedicine had become one of the most popular uses of this platform. doctors also started e-consulation to the patients and getting access to their health records, using the smart wearables of patients. sill, there are many challenges, and lot of improvements are required. the proposed work add towards better energy efficiency of sensors, so that they can work for longer durations. thereafter these sensor data can be integrated using iot and cloud, as per the proposed approach shown in figure 10 . in the recent epidemic of covid-19, telemedicine had become one of the most popular uses of this platform. doctors also started e-consulation to the patients and getting access to their health records, using the smart wearables of patients. sill, there are many challenges, and lot of improvements are required. the proposed work add towards better energy efficiency of sensors, so that they can work for longer durations. thereafter these sensor data can be integrated using iot and cloud, as per the proposed approach shown in figure 10 . in this paper, we proposed a new routing protocol (iop) for intelligently selecting the potential relay node using naïve baye's classifier to achieve energy efficiency and reliability among sensor nodes. residual energy and distance were used to find the probability of a node to become a next-hop forwarder. simulation results showed that the proposed iop improved the network lifetime, stability, and throughput of the sensor networks. the proposed protocol ensured that nodes that are far away from the base station become relay nodes, only when they have sufficient energy for performing this duty. additionally, a node in the middle of the source and destination has the highest probability to become a forwarder in a round. the simulation result showed that the proposed or scheme was better than mdor and eeor in energy efficiency and network lifetime. future work will examine the possibility of ensuring secure data transmission intelligently over the network. the authors declare no conflict of interest. an overview of evaluation metrics for routing protocols in wireless sensor networks comparative study of opportunistic routing in wireless sensor networks opportunistic routing protocols in wireless sensor networks towards green computing for internet of things: energy oriented path and message scheduling approach. sustain toward energy-oriented optimization for green communication in sensor enabled iot environments greedi: an energy efficient routing algorithm for big data on cloud. ad hoc netw an investigation on energy saving practices for 2020 and beyond opportunistic routing-a review and the challenges ahead a revised review on opportunistic routing protocol geographic random forwarding (geraf) for ad hoc and sensor networks: multihop performance opportunistic multi-hop routing for wireless networks optimal forwarder list selection in opportunistic routing simple, practical, and effective opportunistic routing for short-haul multi-hop wireless networks spectrum aware opportunistic routing in cognitive radio networks energy-efficient opportunistic routing in wireless sensor networks a trusted opportunistic routing algorithm for vanet a novel socially-aware opportunistic routing algorithm in mobile social networks opportunistic routing algorithm for relay node selection in wireless sensor networks economy: a duplicate free opportunistic routing mobile-aware service function chain migration in cloud-fog computing service function chain orchestration across multiple domains: a full mesh aggregation approach online learning offloading framework for heterogeneous mobile edge computing system virtualization in wireless sensor networks: fault tolerant embedding for internet of things traffic priority aware medium access control protocol for wireless body area network an energy efficient opportunistic routing metric for wireless sensor networks middle position dynamic energy opportunistic routing for wireless sensor networks an intelligent opportunistic routing protocol for big data in wsns recent advances in energy-efficient routing protocols for wireless sensor networks: a review radio link quality estimation in wireless sensor networks: a survey futuristic trends in network and communication technologies futuristic trends in networks and computing technologies communications in computer and information handbook of wireless sensor networks: issues and challenges in current scenario's lecture notes in networks and systems 121 proceedings of icric 2019 introduction on wireless sensor networks issues and challenges in current era. in handbook of wireless sensor networks: issues and challenges in current scenario's congestion control for named data networking-based wireless ad hoc network deployment and coverage in wireless sensor networks: a perspective key: cord-332313-9m2iozj3 authors: yang, hyeonchae; jung, woo-sung title: structural efficiency to manipulate public research institution networks date: 2016-01-13 journal: technol forecast soc change doi: 10.1016/j.techfore.2015.12.012 sha: doc_id: 332313 cord_uid: 9m2iozj3 with the rising use of network analysis in the public sector, researchers have recently begun paying more attention to the management of entities from a network perspective. however, guiding elements in a network is difficult because of their complex and dynamic states. in a bid to address the issues involved in achieving network-wide outcomes, our work here sheds new light on quantifying structural efficiency to control inter-organizational networks maintained by public research institutions. in doing so, we draw attention to the set of subordinates suitable as change initiators to influence the entire research profiles of subordinates from three major public research institutions: the government-funded research institutes (gris) in korea, the max-planck-gesellschaft (mpg) in germany, and the national laboratories (nls) in the united states. building networks on research similarities in portfolios, we investigate these networks with respect to their structural efficiency and topological properties. according to our estimation, only less than 30% of nodes are sufficient to initiate a cascade of changes throughout the network across institutions. the subunits that drive the network exhibit an inclination neither toward retaining a large number of connections nor toward having a long academic history. our findings suggest that this structural efficiency indicator helps assess structural development or improvement plans for networks inside a multiunit public research institution. public research more inclines to distribute its findings than commercialize in contrast to industrial research (geffen and judd, 2004) . in general, institutes conducting public research are largely government funded and target the public domain (bozeman, 1987) . because of their national orientation and stable funding source, public research institutes do cutting-edge research at least one academic field through long-term plans (greater than three years) (bozeman, 1987) . a public research institution often develops as an association of research institutes rather than a single organization. research entities with a public research institution enjoy institutional autonomy in choice of subjects notwithstanding the fact that they are under the same umbrella of governance. naturally, research organizations have different characteristics depending on national circumstances. some public research institutions, such as the max planck gesellschaft (mpg) in germany, are faithful to pure research (philipps, 2013) , while others have significance within a particular national context: part of the national laboratories (nls) in the united states (us) addresses defense-related technologies (jaffe and lerner, 2001) , and the government-funded research institutes (gris) in korea attempt to assist in the country's economic development by promoting indigenous public research (mazzoleni and nelson, 2005; arnold, 1988; lee, 2013) . with recent advances in our understanding of network, it is possible to apply novel network knowledge to manage public research institutions in response to internal and external changes. for example, entities in national innovation systems (freeman, 2004) or the triple helix models (phillips, 2014; leydesdorff, 2003) can be external factors affecting research of public research institutions. the notion of national innovation systems provides a framework to explain underlying incentive structures for technological development at a national level and international differences in competence from a network perspective of public and private organizations (patel and pavitt, 1994) . the triple helix model considers coevolving academic, industry, and government which provokes techno-economic developments of a country (leydesdorff et al., 2013) . in these systems, public research institutes provide fiscal and technical assistance to other organizations. kondo (kondo, 2011) pointed out that public research institutes dedicated to transferring technologies to industry by means of consulting, licensing, and spinning off. by doing so, they contribute to promoting integration and coordination within the system (provan and milward, 1995) . in order to formulate policies and procedures to steer the entire system, system organizers are able to guide public research institutes properly. in this context, control of those key agencies is important to achieving desirable outcomes. moreover, there is a growing need for an efficient implementation throughout public research institutions composed of multiple sub-organizations in order to deal with internal controls (yang and jung, 2014) . for example, most public research institutions have undergone transformations in recent years due to modernization, imperatives for efficiency, and the promotion of collaboration with the industry (buenstorf, 2009; cohen et al., 2002; simpson, 2004; senker, 2001) . in unfavorable economic conditions, declining government funding causes the restructuring of research areas (malakoff, 2013; izsak et al., 2013) or the government demands more practical outputs from them, such as conducting applied research and setting standards (oecd, 2011) . in an attempt to harness technology for socio-economic development, governments often prioritize future research through foresight activities (priedhorsky and hill, 2006) and accordingly assign new academic missions to public research institutions. in particular, developing countries have lately been paying more attention to the technology-driven development model under government supervision (arnold, 1988) . at that time, controlling every entity enables the institution to fully guide those internal changes but entails great expense. from 1935 to 1945, public research institutions engaged in national strategic areas, including exploration of mineral resources, industrial development, and military research and development (r&d) (oecd, 2011) . after the termination of world war ii, the establishment of public research institutions grew in an effort to advance military technology in many countries. moreover, at that time, public research institutions extended almost all areas with which governments were associated, such as economic and social issues. they continued growing until the 1960s. in the 1970s and 1980s, many countries expressed doubts on their contributions to innovation. however, as deepening the understanding of national innovation systems or the triple helix models, public research institutions started to be seen in a new light. in these models, public research institutions have played an indispensable role in preventing systemic failures, which reduce the overall efficiency of r&d (lundvall, 2007; sharif, 2006) due to their relations with external collaborators (klijn and koppenjan, 2000; mcguire, 2002) . still, the importance of public research institutions are emphasized in particular for scientific innovation (cabanelas et al., 2014) . in this regard, a network approach is necessary to efficiently implement transformations throughout sub-organizations, and the academic interest also grows for the effective operation of the network (cabanelas et al., 2014; jiang, 2014) . there is, however, a lack of empirical research on managing public research institutions through a network system. hence, in this paper, we conceptualize three major public research institutionsthe mpg, nls in the us, and gris in koreaas networks, identify the sub-organizational network structure of each, and examine its structural efficiency. a collaborative research network is one of the most prevalent inter-organizational configurations (shapiro, 2015) . however, we deem that topical similarity between research institutes is suitable to represent a relation between them in research interests. most transformations involve changes in research areas, and changes in organizational research topics frequently occur when governments prioritize specific research fields or delegate new roles to institute (wang and hicks, 2013) . prior studies emphasized the importance of similarity in knowledge content among entities to effectively manage inter-organizational networks as well (tsai, 2001; hansen, 2002) . for these reasons, a network here is formed by pairs of subunits having the most similar research profiles. with the addition of temporal dynamics to inter-organizational relations, a chain of networks over time allows the description of the structural evolution of public research institutions. based on revealed networks, we determined the structural efficiency with which network-wide actions can influence entities for finite time periods. no matter the measure puts in place, all members of network need to adopt it to achieve collective actions. in the early stages of change implementation, network organizers select initiators to change among entities. as the change initiators propagate control actions to the remainder of entities, a public research network can be steered in the desired direction like a car. we can derive a minimum number of suitable initiators from a theory of "structural controllability" (yuan et al., 2013) . in the theory, change initiators refers to injection points of external energy used to steer the network, which are theoretically selected depending on network structure. in this process, structural efficiency is obtained by calculating the share of change initiators in the network: the lower the efficiency value, the smaller the number of entities the network manager is required to handle. therefore, by comparing efficiencies with structural properties over time, we can estimate network characteristics specific to institutions. in this study, we divided institutional research portfolios into six time periods based on scientific output over eighteen years (1995) (1996) (1997) (1998) (1999) (2000) (2001) (2002) (2003) (2004) (2005) (2006) (2007) (2008) (2009) (2010) (2011) (2012) , and estimated structural efficiencies of research similarity networks. considering structural efficiency, we can observe that networks in all three research institutions can be managed with less than 30% of sub-organizations, and the values reflect the changes that have occurred in research institutions. each research institution has some suborganizations consistently selected as suitable change initiators over a period of time. our results primarily highlighted young subordinates as appropriate change initiators, which means that information blockades in network might occur unless the selected units are properly managed. moreover, the estimated changes initiators tend to have a lower connectivity in network than the rest of nodes. we expect that our work has implications for decision-making bodies and network managers seeking to an efficient way to influence their intention on a network of public research institutes. the remainder of this paper is structured as follows: in section 2, we briefly describe the impact of structure on network effectiveness associated with public research institutions based on past research. section 3 is devoted to an explanation of data sources, network construction processes, and the calculation of structural controllability in a network. we discuss the results of our experiments in sections 4 and 5, and offer our conclusions in section 6. methods for utilization and development of networks have grown in an attempt to address complex problems that require collective effort. when the purpose of the network is to deliver public services, independent organizations are generally involved in the process, and interdependency between participants facilitates the formation of links (kickert et al., 1997) . by exchanging knowledge through a network, public research organizations attain a higher level of performance, at the same time, create a greater ability to innovate (morillo et al., 2013) . goldsmith and eggers (2004) claimed that using a vehicle for networks is favorable to organizations that require flexibility, rapidly changing technology, and diverse skills because actors can exchange goals, information, and resources while interacting with each other. resources usually refer to units of transposable value, such as money, materials, and customers, and information signifies exchangeable units between agencies, such as reports, discussions, and meetings. with regard to exchanged goods between organizations, van de ven (van de ven, 1976 ) underlined the importance of information and resources as "the basic elements of activity in organized forms of behavior." in research systems, organizations can take advantage of network participation to have a greater possibility of funding, to broaden their research spectrum, or to reduce the risk of failure (beaver, 2001) . therefore, networks are beneficial because they can pool resources, permit the mutual exploration of opportunities, and create new knowledge (priedhorsky and hill, 2006) . however, strategies are needed to coordinate interactions while managing networks because different actors have different goals and preferences concerning a given problem (kickert et al., 1997; o'mahony and ferraro, 2007) . the capability of network management is also necessary to promote innovations (pittaway et al., 2004) , but there remain questions as to how to manage such organizational interactions as beaver (beaver, 2001) pointed out. orchestrating activities seems unnecessary because of interactions between autonomous organizations, but addressing conflicts keeps agencies cooperative in effort to achieve the goal of the network, thereby facilitating the effective allocation and efficient utilization of network resources. furthermore, a network sometimes needs to be intentionally formed to boost management by governing parties which would be either an external organization or network participant(s) (provan and kenis, 2007) . public research institutions can be said to be governed by external organizations, considering that different entities are in charge of their administration in general, such as ministries, research councils, and other steering bodies. both the mpg and korean gris are apparently steered by a single entity. the fundamental management policy of the nls in the us also originates in a federal agency, although several laboratories are operated by contract partners. by frequently repeating interactions among actors, networks produce certain outcomes. the performance of a network is evaluated according to whether the network effectively attains its goal. the outcome varies depending on governing strategies, and the course of attainment can be enhanced by taking advantage of structural properties of the network (kickert et al., 1997; goldsmith and eggers, 2004) . provan and milward (2001) argued that the assessment of network effectiveness should involve consideration not only of beneficiaries, but also of administrative entities and the participants of the network. nevertheless, the literature on networks has paid more attention to the evaluation of their effectiveness by treating networks as a whole, such that the common goal is primarily involved in network-level accomplishment (provan and milward, 1995; möller and rajala, 2007) . there remain difficulties in determining network effectiveness. the problem primarily resides in the impossibility of quantifying the exact network outcome (provan and lemaire, 2012) . as agranoff (2006) claimed, networks are not always directly related to policy adjustments because some interactions are forged by voluntary information exchange or educational service. in the public research institution, researchers engaged in specialized fields have the opportunity to share ideas across administrative boundaries given that they have the goal and intend to generate public knowledge. outcomes of research networks can be approximated by proxy variables, such as patent and paper citations, innovation counts, new product sales, and productivity growth (council, 1997) . furthermore, such networks also indirectly affect subsequent movements and policies. thus, network efficiency needs to be measured for various types of networks, by considering factors beyond collaborations. in order to increase network effectiveness, structural efficiency in networks is important: since all entities are connected, damage to one part can cause the collapse of the entire system through a cascade of failures. in this regard, considerable research on networks has focused on deliberately building efficiently manageable networks (cabanelas et al., 2014; kickert et al., 1997; van de ven, 1976; provan and kenis, 2007) . certain network structures can affect innovation performance by catalyzing knowledge exchange (valero, 2015) . enemark et al. (2014) argued the importance of the network structure to collective actions via experimental tests that demonstrated structural variations in a network can either improve or degrade network outcomes. however, there is ambiguity in appropriate network structures to achieve effective control. pittaway et al. (2004) suggested that longitudinal network dynamics need to be taken into account when designing network topologies. a network is required to change its members or structures in order to adapt to environmental changes. much of the literature on networks emphasized that instability is an opportunity for transformation (hicklin, 2004) . although the capability of flexible response is one of the strongest features within a network model, such network dynamics challenge for effectively managing networks. with regard to network size, it is widely known that the greater number of actors involved, the more difficult it becomes for the network to achieve collective cooperation (kickert et al., 1997) . increasing the number of participants results in more complex network governance because the number of potential interactions also exponentially escalates. however, prior research found that research networks evolved to be more centralized as growing the network (ferligoj et al., 2015; hanaki et al., 2010) . the growing patterns of research networks imply that adding an entity does not always increase complexity of network management. theorists rather claimed that the introduction of a new node would improve efficiency to control networks (klijn and koppenjan, 2000) . centralization captures the extent of inequality with which important nodes are distributed across the network, and is often measured in terms of freeman's centralities (freeman et al., 1979) . a degree (the number of connections) centralized network is known to readily coordinate across agencies and closely monitor services (provan and milward, 1995) . in complex networks, a minority of nodes, referred to as hubs, dominates connections while the majority is connected with a small number of points (barabási and albert, 1999) . research revealed that complex networks were robust against random attacks (albert et al., 2000) . hubs in research networks were not only empirically impressive in their performance (echols and tsai, 2005; dhanarag and parkhe, 2006) , but also easy to access new knowledge developed by other entities (tsai, 2001) . hanaki, nakajima and ogura (hanaki et al., 2010) also found r&d collaboration networks evolved toward more centralized structures because organizations prefer to collaborate with reliable partners based on referrals obtained from former partners. however, a high degree of integration is not always desirable. provan and lemaire (2012) proposed that connective intensity between organizations should be appropriately controlled for effective network structure. cabanelas et al. (2014) also found research networks producing high performance featured nodes with low degree centrality. no matter the types of networks that develop out of interactions, the goal achievement is possible only when the relevant information spreads throughout the network to encourage actors to conform. in recent years for public research institutions, the controllability of organizational portfolios has been seen as constitutive of dynamic capabilities, which means the "ability to integrate, build, and reconfigure internal and external competencies to address rapidly changing environments" (teece et al., 1997; floricel and ibanescu, 2008) . in this sense, estimating efforts to control entities of public research institutions is related to assessing the feasibility of research reorganization over networks. at the same time, the number of key points in information flow within a network affects burden on the network administration. although earlier work emphasized that selectively activating critical actors is more effective to integration than full activation, the system must secure the capability to exercise influence across agencies (kickert et al., 1997; provan and lemaire, 2012) . furthermore, the efficiency with which network structure can be manipulated would be a suitable criterion to evaluate the built structure. this section is devoted to describing methods of network construction based on collected bibliographies and analytical methods. we describe a quantification method for structural efficiency given structure to control the whole network, and explain structural properties to explore their relation with structural efficiency. in the process of efficiency calculation, we extract suitable organizations to initiate transformation. this investigation was conducted in the r ver. 3.1.2 environment (r core team, 2015) , and used the following add-on packages for convenience: ggplot2 (wickham, 2009 ) and igraph (csardi and nepusz, 2006) . we identified research portfolios based on scientific output, and gathered bibliographic data regarding nls, mpg, and gris from the thomson reuters web of knowledge. academic output over eighteen years (1995) (1996) (1997) (1998) (1999) (2000) (2001) (2002) (2003) (2004) (2005) (2006) (2007) (2008) (2009) (2010) (2011) (2012) was compiled according to institutional names and abbreviations of authors' affiliations. we only used affiliations in english for this study. subordinate research institutes listed in official websites were considered, and their portfolios were tracked using at least twenty papers for each. all disciplines, which are the constituent elements of a portfolio, need to be identified using the same classification system for ease of institutional comparison. we utilize the university of california-san diego's (ucsd) map of science (borner et al., 2012) as a journal-level classification system. the map classified documents into 554 subdisciplines belonging to 13 disciplines on the basis of journal titles. naturally, a research portfolio has two levels of classification: discipline and sub-discipline. particularly, a discipline refers to the aggregate level of sub-disciplines in the hierarchical structure in this study. fig. 1 shows an example of disciplinary mapping using sci2 (sci2 team, 2009) in order to analyze the thematic evolution of network over time, we split the portfolios into time intervals. with regard to an adequate duration of assessment period to represent scientific output being measured, abramo et al. (2012) claimed that a three-year period is adequate to assess scientific outputs. by accepting their recommendations, we observed the development of institutional portfolios for six consecutive time slices. as a well-known analytical method, a complex network is suitable for exploring dynamic topology changes (strogatz, 2001) . here, an inter-organizational network is formed between subordinate institutes building similar research profiles. representing sub-organizations, nodes are connected by a link when two sub-organization have similar research portfolios. in order to measure similarities, we used "inverse frequency factors" for weighting system and "second-order cosine similarities" (garcía et al., 2012) . the inverse frequency factor borrows from a term discrimination method for text retrieval (salton and yang, 1973; salton and buckley, 1988) . the factors weight each subdiscipline in the research portfolio. the weight of sub-discipline m for research institute i is determined by w m;i ¼ f m;i â logð n nm þ where f denotes the number of articles; and ð n nm þ implies the inverse frequency factor to file out prevalent research (jones, 1972) . the logarithmic frequency factor is calculated inversely from the ratio number of subunits n m that publish their achievements in sub-discipline m to the total number n of research institutes. as a result, the set of weights generates a 554 sub-disciplines-by-institute matrix. the similarities between two institutional research portfolios primarily take the cosine measure (salton and mcgill, 1986; baeza-yates and ribeiro-neto, 1999) . for the purpose of improving the accuracy of similarity, we applied second-order approaches to the sub-disciplineby-institute matrix. colliander and ahlgren (2012) explained that first-order approaches directly reflect the similarity between only two profiles, whereas second-order similarities determine those between two given portfolios and other institutional portfolios. a large number of studies have confirmed the superior performance of the secondorder approach as well (ahlgren and colliander, 2009; thijs et al., 2012) . moreover, to render easier structural analysis and network visualization, we strip weak similarities from the research similarity matrix. using the maximum spanning tree (mst) algorithm (kruskal, 1956) , we extracted tree like-structures. the mst algorithm ensures that all institutes are connected with maximal similarity, which implies that the institutes are connected through the most relevant links. therefore, a linked pair of institutes indicates greater potential for common intellectual foundations. among various well-known algorithms to detect mst, the backbone of thematic networks is derived by prim's algorithm (prim, 1957) . in order that the network structure can efficiently elicit the desired response from its elements, a certain amount of energy needs to be injected into the network to change the behavior of actors. thus, the selection of several agencies to initiate changes depending on network structure is inevitable. at the same time, it is important to minimize the number of injection points due to management cost. studies on complex networks consider that nodes can dynamically make decisions or can dynamically change their states by responding to information received through links between nodes. as individual actors, nodes on research networks can be researchers or research institutions, and the nodal states can be represented by individual research interests or disciplinary composition. here, we estimate the capability that controls the behavior of such nodes in complex networks with the minimum involvement of intervention adopting the notion of structural controllability. in recent years, a number of studies have focused on driving networks to a predefined state by combining control theory and network science (liu et al., 2011; wang et al., 2012; lombardi and hörnquist, 2007; gu et al., 2014) . according to network controllability, if a network system is controllable by imposing external signals on a subset of its nodes, called driver nodes, the system can be effectively driven from any initial state to the desired final state in finite time (kalman, 1963; lin, 1974) . thus, network controllability depends on the number and the placement of the control inputs. for this reason, structural efficiency refers to the share of the driver nodes. in this study, agencies found using structural controllability are key locations to steer the entire inter-organizational research network. we applied the structural controllability for undirected networks, introduced by yuan et al. (2013) , to matrix representation of our temporal msts. each temporal matrix g(a) was considered a linear timeinvariant model _ xðtþ ¼ axðtþ, where the vector x ∈ ℝ n represents the state of the nodes at time t, a∈ ℝ n×n denotes the research similarity matrix of mst, such that the value a i,j is the portfolio similarity between institutes i and j (a ij = a ji ). the controlled network g(a, b) corresponds to adding m controllers using ordinary differential equations _ xðtþ ¼ axðtþ þ buðtþ, where vector u(t)∈ ℝ m is the controller and b ∈ ℝ n×m is a control matrix. the problem of finding the driver nodes of the system is solved by the exact controllability theory following the popov-belevitch-hautus (pbh) rank condition (hautus, 1969) . to ensure complete control, the control matrix b should satisfy rank[λ m i n − a, b]=n, where i n is the identity matrix of dimension n, and λ m denotes the maximum geometric multiplicity μ(λ l ) (=n-rank(λ l i n − a)) for the distinct eigenvalues λ l of a. therefore, from a theoretical perspective, changes initiated from the drivers are likely to affect the entire structure. hence, driver institutes are crucial to the functioning of networks for public research institutes. in this paper, we regard the share of drivers in all agencies as an efficiency indicator in that the number of drivers is important for efficient control. network properties have been utilized by a considerable amount of literature in the area to better understand structural features of networks (newman, 2003; albert and barabasi, 2002; woo-young and park, 2012) . in order to understand the relation between efficiency and the inter-organizational research network, we extracted major features across institutions based on some structural properties, such as network size and connectivity. the number of participants represents network size associated with network volume. centrality is one of the most studied indicators in network analysis, and measure the influence of a node in a network using degree centrality (freeman et al., 1979; borgatti et al., 2009; freeman, 1978) . we examine the degree feature of driver nodes. as a nodal attribute, we assign research experience in time periods to nodes to characterize driver nodes. this section contains the major results of our investigation of the structural features of the inter-organizational networks. to form our desired skeletal network, we extracted pairs of academically close institutes based on portfolio similarities among their participants using the construction algorithm of the maximum spanning tree (mst). the results obtained from the backbone networks are related to structural controllability. in order to address the evolution of inter-organizational research, we assessed the structural features of temporal msts. figs. 2-4 show tree-like structures of institutions over time. each node represents a sub-organization, and its size is proportional to the total number of documents published. the colors filling the nodes were determined by the discipline in which the institute was found to be most productive. the portfolio similarities between pairs of linked institutes represented the weight on the network, and these weights affected the width of links as well. the descriptive statistics of portfolio similarity summarize and describe the distribution of the skeletal relationships between subordinates, as listed in table 1 . for all institutions, we found that the distributions were biased toward high similarities between research portfolios. for the nls networks in the us, the overall greater averages and smaller standard deviations of portfolio similarities than the other two institutions indicated that most research units were seen as connected, with the smallest difference in their research areas. on the other hand, in case of the gris, the lowest values of average similarity signified that each unit had a distinct research portfolio. the largest standard deviation and the low values of kurtosis for most time periods also showed that their research similarities were the most widely distributed. in order to represent the dynamic characteristics of msts, their structural properties are listed in table 2 . the number of nodes n increased and, accordingly, the number of links increased to n − 1 following the definition of an mst in the context of a connected network. in spite of sparsity of the network, nodes having a relatively large number of links could be found in some institutions, in particular the oak ridge national laboratory (ornl) within the nls, which was connected to approximately a quarter of the other organizations for four time periods, and the korea research institute of standards and science (kriss) and the korea institute of science and technology (kist), which appeared as a maximally connected node in each half of the dataset. however, there was stiff competition among institutes with the maximum number of connections in the mpg. we noted that network density (2/n) could be obtained from the number of nodes, and the transitivity always reduced to zero because mst rules out cycles. we calculated a periodical change of structural efficiency, and then examined the relations between network efficiency and structural properties. following this, we investigated the features of estimated driver nodes in terms of degree and period of appearance. note that although the number of driver nodes is theoretically fixed in a network, there can be multiple sets of drivers (jia and barabasi, 2013) . we randomly selected a set where multiple driver sets existed. as an indicator of network efficiency obtained from structural controllability, fig. 5 denotes the share of driver nodes over time. according to the graph in the figure, the proportion of drivers varied, but institutions did not have to consider all their agencies for network-wide transformation. less than 30% of nodes were selected as suitable points at which to inject external information in all three institutions because the maximum value of structural efficiency in the entire datasets was about 30% at the second period (1998) (1999) (2000) in the gris. in particular, the nls could be influenced with a relatively small share of drivers among the institutions at all times, which the exception of the period 2004-2006, whereas in the gris, the largest portion of nodes mostly needed to initiate changes. the efficiency fluctuation of the mpg was more stable than other two institutions over time periods. an understating of drivers enables administrators to take preemptive action to prevent information isolation, like the knowledge of the relation between the share of drivers and network efficiency can help plan structural development. the total number of driver appearances for the entire period corresponded to 13, 25, and 53 for the nls, the gris, and mpg, respectively but 6, 15, and 31 agencies were selected as drivers. this was an evidence for the existence of memory in the drivers. moreover, figs. 6 and 7 capture some features of the drivers. fig. 6 compares the average number of links between drivers and the entire nodes over different periods. despite the common knowledge that nodes possessing large connectivity are influential, our results showed that drivers with low connectivity tended to determine collective agreement on the network. fig. 7 shows the average durations of appearance by institutional drivers. we see that the driver nodes were the ones newly entered to the network based on the average durations. of the institutions, the research units of the nls showed a wide difference between drivers and non-drivers. public research has contributed to major innovations by improving competitiveness among existing industries and developing new ones. as prominent contributors to public research, governments have implemented a variety of support policies and programs for higher efficiency and excellence. among the actors involved in public research, public research institutions aim to disseminate their knowledge, by providing various functions: priority-driven research to address national and academic agendas or blue skies research engaging large-scale research facilities to complement university research (pot and reale, 2000) . to maintain such diversity, public research institutions seek to coordinate elements with varying specializations and missions in adapting to dynamic technological environments. as a part of the effort, institutions occasionally attempt to restructure research portfolios or modify organizational placements in relation with other research units. in order to assess the development of public research institutions, we examined structural evolution derived from research similarities in the context of networked organizations in this paper. more precisely, this study focused on public research institutions composed of several specialized research units, and extracted a network from similarities between sub-organizational research portfolios over eighteen years. a pair of connected agencies would be most influenced by the same type of exertion on a specific research area. in addition, suborganizations connected to each other can be potential partners to collaborate because they share similar academic backgrounds. for example, the similarity networks of the gris give implications for inter-disciplinary research groups operated by the research council. in the research group, researchers working at different gris seek a solution together to technological difficulties and research similarities can indicate proper gris to resolve the difficulties. moreover, offering the advantage of predictable network controllability, network modeling helps to understand the system's entire dynamics, which could be guided in finite time by controlling the initiators (liu et al., 2011) . as a result of the modeling, we can measure the efficiency of the network, where network efficiency implied the proportion of elements required as initiators to change the states of the entire agencies. the lower the proportion, the greater the network efficiency because the initiators are injection points for external information. we also revealed the structural properties of estimated initiators. our research here is different from other studies concerning network effectiveness in that it quantitatively estimated the effort required to control an entire inter-organizational network based on its structure. naturally, if we send control signals to every single node, the network is operated with high controllability but involves significant cost. thus, by employing the concept of structural controllability, we can theoretically detect the initial spreaders of information that need to be properly treated. otherwise, they would have produced barriers to the exertion of authority; in extreme cases, the information blockades could have caused network failure (klijn and koppenjan, 2000) . however, handling these elements incurs extra cost, because of which it is important to build networks with the minimum possible number of initiators to reduce enforcement costs incurred for complete control (egerstedt, 2011) . common structural features of estimated initiators can direct network management of public research institutions. we generated results to provide a clear idea of how structural efficiency of research network is related to structural properties, such as size and nodal degree. previous work on network governance structures has provided recommendations on how to build and design inter-organizational networks for innovation acceleration. for example, related to the number of participants, it is natural to expect that the share of drivers would also increase owing to a higher risk of insularity in information due to increasing structural complexity. however, our findings suggest not necessarily complying with the idea. each of the institutions considered by us was different in size from others: the mpg was the largest-scale organization, whereas the nls formed the smallest group in terms of number. however, according to our results, the size of the network did not seem to meaningfully affect table 2 network properties of inter-organizational network. time span 1995-1997 1998-2000 2001-2003 2004-2006 2007-2009 2010-2012 the proportion of drivers in public research institutions. despite being a medium-sized institution, the networks of the gris were more likely to be inefficient than those of the mpg and the nls. we think this was because an institution more experienced with managing such a union has built more effective structures. even we found that the gris took advantage of the structural reorganization of the network because additions improved their network efficiency. in this regard, kickert et al. (1997) claimed that the introduction of new actors can be a strategy to accomplish a mutual adjustment, since the new institute would cause structural changes within the network. proposition 1. a subset of nodes positioned in structurally important locations will have ability to steer a whole network of public research institution. our findings indicated that control actions applied to only less than half of the research units can lead to changes of an entire system, and the units repeatedly appear over time. we suspect the reason that public research institutions are designed to be a cost effective and resilient, as do their infrastructure networks. however, as national research structures can be affected by government policies (hossain et al., 2011) , the network efficiency also changes over time. the drastic fluctuation in the share of drivers would be related to changes in the relevant institution's strategy or operation. for example, the gris underwent a restructuring to remove redundancy, and began operating under the research councils after 1999. we can capture drastic changes at the same time in our results because their structural efficiency significantly increased between the second and third periods between 1998 and 2003. the results would imply that the organizational rearrangements in the gris worked well. besides, the research subjects of the nls were revamped in the 2001-2006 period due to several events, i.e. the september 11 attacks and the outbreak of the severe acute respiratory syndrome (sars). since the terrorist attacks of september 11, 2001, the nls have made greater efforts to reinforce national security by working on nuclear weapons or intelligent detection of potentially dangerous events. moreover, a sudden epidemic of sars accelerated multidisciplinary research in the nls on vaccines, therapeutics, bioinformatics, or bioterrorism. we also find that the structural efficiency of the nls were severely affected during the readjustment period. these changes in portfolio composition would cause temporary disarray in the structure of the networks. on the other hand, the property of stable fluctuations in the mpg would be attributable to internal transitions for scientific advances rather than external impact. the mpg makes an expansion of research topics by mostly spinning off units because each unit has its own research area. proposition 2. variations in structural efficiency of research networks will reflect structural changes in research composition. another difference between past research and our work here is that degree centralization is not invariably recommendable. policy makers and network scientists have hitherto paid attention to highly connected institutes because hubs are regarded as network facilitators. however, our findings indicated that most key elements were apt to have low degrees. our study focused on revealing the injection points to infuse their nearest neighbors with energy regardless of the amount of energy required, and the nodes impart directions to connected neighbors at a time rather than exuding control forces over their adjacencies simultaneously. obviously, the energy-entering hub can effectively reach agencies within its orbit, but there is a diffusion range. thus, our observations suggest that network-wide influence was dependent upon nodes with a low connectivity. in this context, a network with moderately distributed focal points can be more effective to influence all organizations than a thoroughly concentrated one. furthermore, emergent sub-organizations show a tendency to have greater effect on structural efficiency than sub-organizations with a long research experience. we suspect that this is why a new research institute is often derived from a larger unit in public research institutions, holds low research similarity with other units than its parent, and takes position at the border, beyond any energy ranges. another possible is that that a newly-established research institute has unstable research portfolios, as braam and van den besselaar (2014) pointed out. the instability that a new research institute has in its research areas can increase uncertainty about the consequences of network-wide changes. therefore, network managers may have the need to monitor the degree of acceptance of a network-wide action especially among emerging sub-organizations. this result is also consistent with recent observations, whereby driver nodes in real-world networks tend to be reluctant to link with high-degree nodes (liu et al., 2011) . proposition 3. other things being equal, a possibility to control the whole research network will increase when control actions work properly at nodes with a low connectivity and a short research history. we consider the differences in network effectiveness between existing studies and our findings originate from: whether the complete functioning of all elements was considered. previous studies regarding the maximization of network effectiveness implicitly presupposed the complete performance of all entities a priori despite conflicts between participants, but at least network managers need to ensure complete operation of their network. for the full functioning of a network, all elements are required to be within the sphere of influence of the network manager for network-wide control. we deal with a possibility to manage the network behavior in public research institutions by quantifying the effort required to implement maneuvers. in order to avoid control blockades, we showed the importance of the elements, inter alia, with low connectivity and brief experience in academia. this study provided theoretical results for structural controllability assuming some ideal situations, such as that measures were implemented on network skeleton without redundant connectivity, sufficient resources were provided to change the network, all institutes respected the administrator's intention, and there were no conflicts between a pair of connected institutes. the success or failure of such measures can be determined once the processes are completely implemented because a network's dynamic nature in inter-organizational network raises difficulties in coordination. nevertheless, estimating the completion of network-wide objectives is still critical to network planning and design. our theoretical calculations here can assist decision making for structural improvement plans. moreover, the common features of the selected initiators are sufficient to suggest elements significant to attaining a synchronized response across and institutional network to reorganize research portfolio. public research institutions continue to gain prominence in the development national agendas of science and technology. institutions have their own strategies according to their values and interests in research trends to a greater or lesser extent. governments and research councils significantly affect these institutes through polices, programs, funding, and financial support in an effort to better coordinate their research agencies (rammer, 2006) . therefore, guiding the subunits of these institutes in a network is important to efficiently deliver managerial control. in doing so, administrators should be concerned with improving network structure to enhance its outcomes. however, manipulating network structure is difficult because of complex and dynamic states of the sub-organizations. in this study, we quantified network structural efficiency to maneuver a set of spontaneous elements into network-wide goals by using the theory of structural controllability (yuan et al., 2013) , and tracked the efficiency of networks of these public research institutions: the gris in korea, the nls in the us, and the mpg. for the relevant calculations, we extracted a hidden network structure from each institution based on similarities between the profiles of their subordinate organizations. the results of structural efficiency enabled the assessment of the operational strategies of each institution for eighteen years. the elements selected by structural controllability implied suitable points to inject external energy for governing networks. revealing the injection points was important to prevent information blockages that hinder collective action. apparently, the greater number of injection points required, the lesser the efficiency of network due to the increased burden of management. our findings indicate that structural efficiencies reflect changes in research interests of an institution. in this sense, research institutions have the necessity to track the structural controllability to assess their structural changes, such as portfolio adjustments on all of sub-organizations. the structural controllability can also provide the suitable spots for an intervention by a network manager (ministries, research councils, or steering bodies) as driver nodes with regard to structural changes. according to our results, the proper intervention points tend to be with a low connectivity as well as young suborganizations. in spite of these implications for managing strategies of interorganizational networks, this study has shortcomings that limit the generalizability of our findings. scientific articles represent only part of an institute's capacity for research. major scientific outputs are classified into two types: scientific articles and patents. depending on the major research types, some institutes concentrate on patents instead of publications. as a result, research portfolios derived from richer data sources than were used would more precisely depict institutional research capacity. another limitation of this study is that network properties other than those considered here, such as network density, clustering coefficient, and betweenness centrality, might affect structural efficiency like. furthermore, our findings raised several questions that suggest directions for future research. these include exploring the range of drivers' influence on structural efficiency, determining the optimal network structure to steer, and investigating diverse network properties with other types of players in innovation systems, e.g., academia and industries. what is the appropriate length of the publication period over which to assess research performance? inside collaborative networks: ten lessons for public managers document-document similarity approaches and science mapping: experimental comparison of five approaches statistical mechanics of complex networks error and attack tolerance of complex networks science and technology development in taiwan and south korea modern information retrieval emergence of scaling in random networks reflections on scientific collaboration (and its study): past, present, and future network analysis in the social sciences design and update of a classification system: the ucsd map of science all organizations are public: bridging public and private organizational theories indicators for the dynamics of research organizations: a biomedical case study is commercialization good or bad for science? individual-level evidence from the max planck society influence of governance on regional research network performance links and impacts: the influence of public research on industrial r&d experimental comparison of first and second-order similarities in a scientometric context industrial research and innovation indicators: report of a workshop the igraph software package for complex network research orchestrating innovation networks. acad niche and performance: the moderating role of network embeddedness complex networks: degrees of control knowledge and networks: an experimental test of how network knowledge affects coordination scientific collaboration dynamics in a national scientific system using r&d portfolio management to deal with dynamic risk centrality in social networks conceptual clarification technological infrastructure and international competitiveness centrality in social networks: ii. experimental results mapping academic institutions according to their journal publication profile: spanish universities as a case study innovation through initiatives-a framework for building new capabilities in public sector research organizations governing by network: the new shape of the public sector controllability of brain networks the dynamics of r&d network in the it industry knowledge networks: explaining effective knowledge sharing in multiunit companies controllability and observability conditions of linear autonomous systems network stability: opportunity or obstacles? mapping the dynamics of knowledge base of innovations of r&d in bangladesh: triple helix perspective the impact of the crisis on research and innovation policies, in, european commission dg research reinventing public r&d: patent policy and the commercialization of national laboratory technologies control capacity and a random sampling method in exploring controllability of complex networks international student flows between asia, australia, and russia: a network analysis a statistical interpretation of term specificity and its application in retrieval mathematical description of linear dynamical systems managing complex networks: strategies for the public sector public management and policy networks on the shortest spanning subtree of a graph and the traveling salesman problem multidisciplinary team research as an innovation engine in knowledgebased transition economies and implication for asian countries the mutual information of university-industry-government relations: an indicator of the triple helix dynamics a routine for measuring synergy in university-industry-government relations: mutual information as a triple-helix and quadruple-helix indicator structural controllability controllability of complex networks controllability analysis of networks national innovation systems-analytical concept and development tool as budgets tighten, washington talks of shaking up doe labs the roles of research at universities and public labs in economic catch-up. columbia university, initiative for policy dialogue, working paper managing networks: propositions on what managers do and why they do it rise of strategic nets -new modes of value creation do networking centres perform better? an exploratory analysis in psychiatry and gastroenterology/ hepatology in spain the structure and function of complex networks public research institutions: mapping sector trends the emergence of governance in an open source community national innovation systems: why they are important, and how they might be measured and compared mission statements and self-descriptions of german extra-university research institutes: a qualitative content analysis triple helix and the circle of innovation networking and innovation: a systematic review of the evidence convergence and differentiation in institutional change among european public research systems: the decreasing role of public research institutes identifying strategic technology directions in a national laboratory setting: a case study shortest connection networks and some generalizations modes of network governance: structure, management, and effectiveness core concepts and key ideas for understanding public sector organizational networks: using research to inform scholarship and practice a preliminary theory of interorganizational network effectiveness: a comparative study of four community mental health systems do networks really work? a framework for evaluating public-sector organizational networks r: a language and environment for statistical computing trends in innovation policy: an international comparison term-weighting approaches in automatic text retrieval introduction to modern information retrieval on the specification of term values in automatic indexing science of science (sci2) tool changing organisation of public-sector research in europe-implications for benchmarking human resources in rtd establishing 'green regionalism': environmental technology generation across east asia and beyond emergence and development of the national innovation systems concept after the reforms: how have public science research organisations changed? r&d manag exploring complex networks dynamic capabilities and strategic management do second-order similarities provide added-value in a hybrid approach knowledge transfer in intraorganizational networks: effects of network position and absorptive capacity on business unit innovation and performance effective leadership in public organizations: the impact of organizational structure in asian countries on the nature, formation, and maintenance of relations among organizations detecting structural change in university research systems: a case study of british research policy optimizing controllability of complex networks by minimum structural perturbations the network structure of the korean blogosphere a strategic management approach for korean public research institutes based on bibliometric investigation exact controllability of complex networks hyeonchae yang is a ph.d. candidate of the graduate program for technology and innovation management at pohang university of science and technology the authors are grateful to the editors of the journal and the reviewers for their support and work throughout the process. this work is supported by mid-career researcher program through the national research foundation of korea (nrf) grant funded by the ministry of science, ict and future planning (2013r1a2a2a04017095). key: cord-285647-9tegcrc3 authors: estrada, ernesto title: fractional diffusion on the human proteome as an alternative to the multi-organ damage of sars-cov-2 date: 2020-08-17 journal: chaos doi: 10.1063/5.0015626 sha: doc_id: 285647 cord_uid: 9tegcrc3 the coronavirus 2019 (covid-19) respiratory disease is caused by the novel coronavirus sars-cov-2 (severe acute respiratory syndrome coronavirus 2), which uses the enzyme ace2 to enter human cells. this disease is characterized by important damage at a multi-organ level, partially due to the abundant expression of ace2 in practically all human tissues. however, not every organ in which ace2 is abundant is affected by sars-cov-2, which suggests the existence of other multi-organ routes for transmitting the perturbations produced by the virus. we consider here diffusive processes through the protein–protein interaction (ppi) network of proteins targeted by sars-cov-2 as an alternative route. we found a subdiffusive regime that allows the propagation of virus perturbations through the ppi network at a significant rate. by following the main subdiffusive routes across the ppi network, we identify proteins mainly expressed in the heart, cerebral cortex, thymus, testis, lymph node, kidney, among others of the organs reported to be affected by covid-19. scitation.org/journal/cha hypothesized as a potential cause of the major complications of the covid-19. 11, 12 however, it has been found that ace2 has abundant expression on endothelia and smooth muscle cells of virtually all organs. 13 therefore, it should be expected that after sars-cov-2 is present in circulation, it can be spread across all organs. in contrast, both sars-cov and sars-cov-2 are found specifically in some organs but not in others, as shown by in situ hybridization studies for sars-cov. this was already remarked by hamming et al. 13 by stressing that it "is remarkable that so few organs become viruspositive, despite the presence of ace2 on the endothelia of all organs and sars-cov in blood plasma of infected individuals." recently, gordon et al. 14 identified human proteins that interact physically with those of the sars-cov-2 forming a high confidence sars-cov-2-human protein-protein interaction (ppi) system. using this information, gysi et al. 15 discovered that 208 of the human proteins targeted by sars-cov-2 forms a connected component inside the human ppi network. that is, these 208 are not randomly distributed across the human proteome, but they are closely interconnected by short routes that allow moving from one to another in just a few steps. these interdependencies of protein-protein interactions are known to enable that perturbations on one interaction propagate across the network and affect other interactions. [16] [17] [18] [19] in fact, it has been signified that diseases are a consequence of such perturbation propagation. [20] [21] [22] it has been stressed that the protein-protein interaction process requires diffusion in their initial stages. 23 the diffusive processes occur when proteins, possibly guided by electrostatic interactions, need to encounter each other many times before forming an intermediate. 24 not surprisingly, diffusive processes have guided several biologically oriented searches in ppi networks. 25, 26 therefore, we assume here that perturbations produced by sars-cov-2 proteins on the human ppi network are propagated by means of diffusive processes. however, due to the crowded nature of the intra-cell space and the presence in it of spatial barriers, subdiffusive processes more than normal diffusion are expected for these protein-protein encounters. [27] [28] [29] this creates another difficulty, as remarked by batada et al., 23 which is that such (sub)diffusive processes along are not sufficient for carrying out cellular processes at a significant rate in cells. here, we propose the use of a time-fractional diffusion model on the ppi network of proteins targeted by sars-cov-2. the goal is to model the propagation of the perturbations produced by the interactions of human proteins with those of sars-cov-2 through the whole ppi. the subdiffusive process emerging from the application of this model to the sars-cov-2-human ppis has a very small rate of convergence to the steady state. however, this process produces a dramatic increment of the probability that certain proteins are perturbed at very short times. this kind of shock wave effect of the transmission of perturbations occurs at much earlier times in the subdiffusive regime than at the normal diffusion one. therefore, we propose here a switch and restart process in which a subdiffusive process starts at a given protein of the ppi, perturbs a few others, which then become the starting point of a new subdiffusive process and so on. using this approach, we then analyze how the initial interaction of the sars-cov-2 spike protein with a human protein propagates across the whole network. we discover some potential routes of propagation of these perturbations from proteins mainly expressed in the lungs to proteins mainly expressed in other different tissues, such as the heart, cerebral cortex, thymus, lymph node, testis, prostate, liver, small intestine, duodenum, kidney, among others. a. settling a model the problem we intend to model here is of a large complexity as it deals with the propagation of perturbations across a network of interacting proteins, each of which is located in a crowded intracellular space. therefore, we necessarily have to impose restrictions and make assumptions to settle our modeling framework. as we have mentioned in sec. i, protein encounters should necessarily occur in subdiffusive ways due to the crowded environment in which they are embedded, as well as the existence of immobile obstacles such as membranes. by a subdiffusive process, we understand that the mean square displacement of a protein scales as where 0 < κ < 1 is the anomalous diffusion exponent. as observed by sposini et al., 30 these anomalous diffusive processes can emerge from (i) continuous time random walk (ctrw) processes or by (ii) viscoelastic diffusion processes. in the first case, the "anomaly" is created by power-law waiting times in between motion events. this kind of processes is mainly accounted for by the generalized langevin equation with the power-law friction kernel as well as by fractional brownian motion (fbm). while the first processes are characterized by the stretched gaussian displacement probability density, weak ergodicity, and aging, the second ones are ergodic processes characterized by the gaussian density probability distribution. therefore, our first task is to discern which of these two kinds of approaches is appropriate for the current scenario. we start by mentioning that weiss et al. 31 have analyzed data from fluorescence correlation spectroscopy (fcs) for studying subdiffusive biological processes. they have, for instance, reported that membrane proteins move subdiffusively in the endoplasmic reticulum and golgi apparatus in vivo. subdiffusion of cytoplasmatic macromolecules was also reported by weiss et al. 32 using fcs. then, guigas and weiss 27 simulated the way in which the subdiffusive motion of these particles should occur in a crowded intracellular fluid. they did so by assigning diffusive steps from a weirstrass-mandelbrot function yielding a fbm. they stated that ctrw was excluded due to its markovian nature. in another work, szymanski and weiss 33 used fcs and simulations to analyze the subdiffusive motion of a protein in a simulated crowded medium. first, they reported that crowded-induced subdiffusion is consistent with the predictions from fbm or obstructed (percolation-like) diffusion. second, they reported that ctrw does not explain the experimental results obtained by fcs and should not be appropriated for such processes. the time resolution of fcs is in the microsecond range, i.e., 10 −6 s. 34 however, an important question on biological subdiffusion may require higher time resolution to be solved. this is the question of how diffusive processes on short times, while the macromolecule has not felt yet the crowding of the environment, is related to the long-time diffusion. this particular problem was explored experimentally by gupta et al. 35 by using state-of-the-art neutron chaos article scitation.org/journal/cha spin-echo (nse) and small-angle neutron scattering (sans), which has a resolution in the nanosecond range, i.e., 10 −9 s. their experimental setting was defined by the use of two globular proteins in a crowded environment formed by poly(ethylene oxide) (peo), which mimics a macromolecular environment. in their experiments, nse was used to tackle the fast diffusion process, which corresponds to a dynamics inside a trap built by the environment mesh. sans captures the slow dynamics, which corresponds to the long-time diffusion at macroscopic length scales. from our current perspective, the most important result of this work is that the authors found that in a higher concentration of polymeric solutions, like in the intracellular space, the diffusion is fractional in nature. they showed this by using the fractional fokker-planck equation with a periodic potential. according to gupta et al., 35 this fractional nature of the crossover from fast dynamics to slow macroscopic dynamics is due to the heterogeneity of the polymer mesh in the bulk sample, which may well resemble the intra-cellular environment. as proved by barkai et al. , 36 the fractional fokker-planck equation can be derived from the ctrw, which clearly indicates that the results obtained by gupta et al. point out to the classification of the subdiffusive dynamics into the class (i). we should remark that independently of these results by gupta et al., 35 shorten and sneyd 37 have successfully used the fractional diffusion equation to mimic the protein diffusion in an obstructed media like within skeletal muscle. we notice in passing that the (fractional) diffusion equation can be obtained from the (fractional) fokker-planck equation in the absence of an external force. in closing, because here we are interested in modeling the diffusion of proteins in several human cells, which are highly crowded, and in which we should recover the same crossover between initial fast and later slow dynamics, we will consider a modeling tool of the class (i). in particular, we will focus our modeling on the use of a time-fractional diffusion equation using caputo derivatives. another justification for the use of this model here is that interacting proteins can be in different kinds of cells. thus, we consider that the perturbation of one protein is not necessarily followed by the perturbation of one of its interactors, but a time may mediate between the two processes. this is exactly the kind of processes that the time-fractional diffusion captures. in this work, we always consider g = (v, e) to be an undirected finite network with vertices v representing proteins and edges e representing the interaction between pairs of proteins. let us consider 0 < α ≤ 1 and a function u : [0, ∞) → r, then we denote by d α t u the fractional caputo derivative of u of the order α, which is given by 38 where * denotes the classical convolution product on (0, ∞) and g γ (t) t γ −1 (γ ) , for γ > 0,where (·) is the euler gamma function. observe that the previous fractional derivative has sense whenever the function is derivable and the convolution is defined (for example, if u is locally integrable). the notation g γ is very useful in the fractional calculus theory, mainly by the property g γ * g δ = g γ +δ for all γ , δ > 0. here, we propose to consider the time-fractional diffusion (tfd) equation on the network as with the initial condition x (0) = x 0 , where x i (t) is the probability that the protein i is perturbed at the time t; c is the diffusion coefficient of the network, which we will set hereafter to unity; and l is the graph laplacian, i.e., l = k − a, where k is a diagonal matrix of node degrees and a is the adjacency matrix. this model was previously studied in distributed coordination algorithms for the consensus of multi-agent systems. [39] [40] [41] the use of fractional calculus in the context of physical anomalous diffusion has been reviewed by metzler and klafter. 42 a different approach has been developed by riascos and mateos. 43, 44 it is based on the use of fractional powers of the graph laplacian (see ref. 45 and references therein). the approach has been recently formalized by benzi et al. 46 this method cannot be used in the current framework because it generates only superdiffusive behaviors (see benzi et al. 46 ) and not subdiffusive regimes. another disadvantage of this approach is that it can only be used to positive (semi)definite graph operators, such as the laplacian, but not to adjacency operators such as the one used in tight-binding quantum mechanical or epidemiological approaches (see sec. vi). theorem 1. the solution of the fractional-time diffusion model on the network is where e α,β (γ l) is the mittag-leffler function of the laplacian matrix of a graph. proof. we use the spectral decomposition of the network laplacian l = uλu −1 , where u = ψ 1 · · · ψ n and λ = diag (µ r ). then, we can write let us define y (t) = u −1 x (t) such that d α t x (t) = −uλy (t), and we have as is a diagonal matrix, we can write which has the solution we can replace which finally gives the result in the matrix-vector when written for all the nodes, we can write l = uλu −1 , where u = ψ 1 · · · ψ n and λ = diag (µ r ). then, which can be expanded as where ψ j and φ j are the jth column of u and of u −1 , respectively. because µ 1 = 0 and 0 < µ 2 ≤ · · · ≤ µ n for a connected graph, we have lim where ψ t 1 φ 1 = 1. let us take ψ 1 = 1, such that we have this result indicates that in an undirected and connected network, the diffusive process controlled by the tfd equation always reaches a steady state, which consists of the average of the values of the initial condition. in the case of directed networks (ppi are not directed by nature) or in disconnected networks (a situation that can be found in ppis), the steady state is reached in each (strongly) connected component of the graph. also, because the network is connected, µ 2 makes the largest contribution to e α,1 (−t α l) among all the nontrivial eigenvalues of l. therefore, it dictates the rate of convergence of the diffusion process. we remark that in practice, the steady state lim t→∞ x v (t) − x w (t) = 0, ∀v, w ∈ v is very difficult to achieve. therefore, we use a threshold ε, e.g., ε = 10 −3 , such that lim t→∞ x v (t) − x w (t) = ε is achieved in a relatively small simulation time. due to its importance in this work, we remark the structural meaning of the mittag-leffler function of the laplacian matrix appearing in the solution of the tfd equation. that is, e α,1 (−t α l) is a matrix function, which is defined as where (·) is the euler gamma function as before. we remark that for α = 1, we recover the diffusion equation on the network: dx (t) /dt = −lx (t) and its solution e 1,1 (−t α l) = exp (−tcl) is the well-known heat kernel of the graph. we define here a generalization of the diffusion distance studied by coifman and lafon. 47 then, we define the following quantity: we have the following result. proof. the matrix function f (τ l) can be written as f (τ l) = uf (τ λ) u −1 . let ϕ u = ψ 1,u , ψ 2,u , . . . , ψ n,u t . then, therefore, because f (τ l) is positively defined, we can write where consequently, d vw is a square euclidean distance between v and w. in this sense, the vector we have that d vw generalizes the diffusion distance studied by coifman and lafon, which is the particular case when α = 1. let µ j be the jth eigenvalue and ψ ju the uth entry of the jth eigenvector of the laplacian matrix. then, we can write the time-fractional diffusion distance as it is evident that when α = 1, d vw is exactly the diffusion distance previously studied by coifman and lafon. 47 the fractionaltime diffusion distance between every pair of nodes in a network can be represented in a matrix form as follows: where s = f(τ l) 11 , f(τ l) 22 , . . . , f(τ l) nn is a vector whose entries are the main diagonal terms of the mittag-leffler matrix function, 1 is an all-ones vector, and • indicates an entrywise operation. using this matrix, we can build the diffusion distance-weighted adjacency matrix of the network, the shortest diffusion path between two nodes is then the shortest weighted path in w (τ ). lemma 3. the shortest (topological) path distance between two nodes in a graph is a particular case of the time-fractional shortest diffusion path length for τ → 0. proof. let us consider each of the terms forming the definition of the time-fractional diffusion distance and apply the limit of the very small −t α . that is, chaos article scitation.org/journal/cha and in a similar way, therefore, lim , which immediately implies that the time-fractional shortest diffusion path is identical to the shortest (topological) one in the limit of very small τ = −t α . the proteins of sars-cov-2 and their interactions with human proteins were determined experimentally by gordon et al. 14 gysi et al. 15 constructed an interaction network of all 239 human proteins targeted by sars-cov-2. in this network, the nodes represent human proteins targeted by sars-cov-2 and two nodes are connected if the corresponding proteins have been determined to interact with each other. obviously, this network of proteins targeted by sars-cov-2 is a subgraph of the protein-protein interaction (ppi) network of humans. one of the surprising findings of gysi et al. 15 is the fact that this subgraph is not formed by proteins randomly distributed across the human ppi, but they form a main cluster of 208 proteins and a few small isolated components. hereafter, we will always consider this connected component of human proteins targeted by sars-cov-2. this network is formed by 193 proteins, which are significantly expressed in the lungs. gysi et al. 15 reported a protein as being significantly expressed in the lungs if its gtex median value is larger than 5. gtex 48 is a database containing the median gene expression from rna-seq in different tissues. the other 15 proteins are mainly expressed in other tissues. however, in reporting here, the tissues that were proteins are mainly expressed; we use the information reported in the human protein atlas 49 where we use information not only from gtex but also from hpa (see details at the human protein atlas webpage) and fantom5 50 datasets. the ppi network of human proteins targeted by sars-cov-2 is very sparse, having 360 edges, i.e., its edge density is 0.0167, 30% of nodes have a degree (number of connections per protein) equal to one, and the maximum degree of a protein is 14. the second smallest eigenvalue of the laplacian matrix of this network is very small; i.e., µ 2 = 0.0647. therefore, the rate of convergence to the steady state of the diffusion processes taking place on this ppi is very slow. we start by analyzing the effects of the fractional coefficient α on these diffusive dynamics. we use the normal diffusion α = 1 as the reference system. to analyze the effects of changing α over the diffusive dynamics on the ppi network, we consider the solution of the tfd equation for processes starting at a protein with a large degree, i.e., prkaca, degree 14, and a protein with a low degree, i.e., mrps5, degree 3. that is, the initial condition vector consists of a vector having one at the entry corresponding to either prkaca or mrps5 and zeroes elsewhere. in fig. 1 , we display the changes of the probability with the shortest path distance from the protein where the process starts. this distance corresponds to the number of steps that the perturbation needs to traverse to visit other proteins. for α = 1.0, the shapes of the curves in fig. 1 are the characteristic ones for the gaussian decay of the probability with distance. however, for α < 1, we observe that such decay differs from that typical shape showing a faster initial decay followed by a slower one. in order to observe this effect in a better way, we zoomed the region of distances from 2 to 4 [see figs. 1(b) and 1(d)]. as can be seen for distances below 3, the curve for α = 1.0 is on top of those for α < 1, indicating a slower decay of the probability. after this distance, there is an inversion, and the normal diffusion occurs at a much faster rate than the other two for the longer distances. this is a characteristic signature of subdiffusive processes, which starts at much faster rates than a normal diffusive process and then continue at much slower rates. therefore, here, we observe that the subdiffusive dynamics are much faster at earlier times of the process, which is when the perturbation occurs to close nearest neighbors to the initial point of perturbation. to further investigate these characteristic effects of the subdiffusive dynamics, we study the time evolution of a perturbation occurring at a given protein and its propagation across the whole ppi network. in fig. 2 , we illustrate these results for α = 1.0 (a), α = 0.75 (b), and α = 0.5 (c). as can be seen in the main plots of this figure, the rate of convergence of the processes to the steady state is much faster in the normal diffusion (a) than in the subdiffusive one (b) and (c). however, at very earlier times (see insets in fig. 2 ), there is a shock wave increase of the perturbation at a set of nodes. such kind of shock waves has been previously analyzed in other contexts as a way of propagating effects across ppi networks. 17 we have explored briefly about the possible causes of this increase in the concentration for a given subset of proteins. accordingly, it seems that the main reason for this is the connectivity provided by the network of interactions and not a given distribution of the degrees. for instance, we have observed such "shock waves" in networks with normal-like distributions as well as with power-law ones. however, it is possible that the extension and intensity of such effects depend on the degree distribution as well as on other topological factors. the remarkable finding here is, however, the fact that such a shock wave occurs at much earlier times in the subdiffusive regimes than at the normal diffusion. that is, while for α = 1.0, these perturbations occur at t≈0.1-0.3; for α = 0.75, they occur at t≈0.0-0.2; and for α = 0.5, they occur at t≈0.0-0.1. seeing this phenomenon in the light of what we have observed in the previous paragraph is not strange due to the observation that such processes go at a much faster rate at earlier times, and at short distances, than the normal diffusion. in fact, this is a consequence of the existence of a positive scalar t for which e α,1 (−γ t α ) decreases faster than exp (−γ t) for t ∈ (0, t) for γ ∈ r + and α ∈ r + (see theorem 4.1 in ref. 39 ). hereafter, we will consider the value of α = 0.75 for our experiments due to the fact that it reveals a subdiffusive regime, but the shock waves observed before are not occurring in an almost instantaneous way like when α = 0.5 , which would be difficult from a biological perspective. the previous results put us at a crossroads. first, the subdiffusive processes that are expected due to the crowded nature of the intra-cellular space are very slow for carrying out cellular processes at a significant rate in cells. however, the perturbation shocks occurring at earlier times of these processes are significantly faster than in normal diffusion. to sort out these difficulties, we propose a switching back and restart subdiffusive process occurring in the ppi network. that is, a subdiffusive process starts at a given protein, which is directly perturbed by a protein of sars-cov-2. it produces a shock wave increase of the perturbation in close neighbors of that proteins. then, a second subdiffusive process starts at these newly perturbed proteins, which will perturb their nearest neighbors. the process is repeated until the whole ppi network is perturbed. this kind of "switch and restart processes" has been proposed for engineering consensus protocols in multiagent systems 51 as a way to accelerate the algorithms using subdiffusive regimes. the so-called spike protein (s-protein) of the sars-cov-2 interacts with only two proteins in the human hosts, namely, zdhhc5 and golga7. the first protein, zdhhc5, is not in the main connected component of the ppi network of sars-cov-2 targets. therefore, we will consider here how a perturbation produced by the interaction of the virus s-protein with golga7 is propagated through the whole ppi network of sars-cov-2 targets. golga7 has degree one in this network, and its diffusion is mainly to close neighbors, namely, to proteins separated by two to three edges. when starting the diffusion process at the protein golga7, the main increase in the probability of perturbing another protein is reached for the protein golga3, which increases its probability up to 0.15 at t = 0.2, followed by prkar2a, with a small increase in its probability, 0.0081. then, the process switch and restarts at golga3, which mainly triggers the probability of the protein prkar2a-a major hub of the network. once we start the process at prkar2a, practically, the whole network is perturbed with probabilities larger than 0.1 for 19 proteins apart from golga3. these proteins are in decreasing order of their probability of being perturbed: akap8, prkar2b, cep350, mib1, cdk5rap2, cep135, akap9, cep250, pcnt, cep43, pde4dip, prkaca, tub6cp3, tub6cp2, cep68, clip4, cntrl, plekha5, and ninl. notice that the number of proteins perturbed is significantly larger than the degree of the activator, indicating that not only nearest neighbors are activated. an important criterion for revealing the important role of the protein prkar2a as a main propagator in the network of proteins targeted by sars-cov-2 is its average diffusion path length. this is the average number of steps that a diffusive process starting at this protein needs to perturb all the proteins in the network. we have calculated this number to be 3.6250, which is only slightly larger than the average (topological) path length, which is 3.5673. that is, in less than four steps, the whole network of proteins is activated by a diffusive process starting at prkar2a. also remarkable that the average shortest diffusive path length is almost identical to the shortest (topological) one. this means that this protein mainly uses shortest (topological) paths in perturbing other proteins in the ppi. in other words, it is highly efficient in conducting such perturbations. we will analyze this characteristics of the ppi of human proteins targeted by sars-cov-2 in a further section of this work. at this time, almost any protein in the ppi network is already perturbed. therefore, we can switch and restart the subdiffusion from practically any protein at the ppi network. we then investigate which are the proteins with the higher capacity of activating other proteins that are involved in human diseases. here, we use the database disgenet, 52 which is one of the largest publicly available collections of genes and variants associated with human diseases. we identified 38 proteins targeted by sars-cov-2 for which there is a "definitive" or "strong" evidence of being involved in a human disease or syndrome (see table s1 in the supplementary material). these proteins participate in 70 different human diseases or syndromes as given in tables s2 and s3 of the supplementary material. we performed an analysis in which a diffusive process starts at any protein of the network, and we calculated the average probability that all the proteins involved in human diseases are then perturbed. for instance, for a subdiffusive process starting at the protein arf6, we summed up the probabilities that the 38 proteins involved in diseases are perturbed at an early time of the process t = 0.2. then, we obtain a global perturbation probability of 0.874. by repeating this process for every protein as an initiator, we obtained the top disease activators. we have found that none of the 20 top activators is involved itself in any of the human diseases or syndromes considered here. they are, however, proteins that are important not because of their direct involvement in diseases or syndromes but because they propagate perturbations in a very effective way to those directly involved in such diseases/syndromes. among the top activators, we have found arf6, ecsit, retreg3, stom, hdac2, exosc5, thtpa, among others shown in fig. 3 , where we illustrate the ppi network of the proteins targeted by sars-cov-2 remarking the top 20 disease activators. we now consider how a perturbation produced by sars-cov-2 on a protein mainly expressed in the lungs can be propagated to proteins mainly located in other tissues (see table s4 in the supplementary material) by a subdiffusive process. that is, we start the subdiffusive process by perturbing a given protein, which is mainly expressed in the lungs. then, we observe the evolution of the perturbation at every one of the proteins mainly expressed in other tissues. we repeat this process for all the 193 proteins mainly expressed in the lungs. in every case, we record those proteins outside the lungs, which are perturbed at very early times of the subdiffusive process. for instance, in fig. 4 , we illustrate one example in which the initiator is the protein golga2, which triggers a shock wave on proteins rbm41, tl5, and pkp2, which are expressed mainly outside the lungs. we consider such perturbations only if they occur at t < 1. not every one of the proteins expressed outside the lungs is triggered by such shock waves at a very early time of the diffusion. for instance, proteins mark1 and slc27a2 are perturbed in very slow processes and do not produce the characteristic high peaks in the probability at very short times. on the other hand, there are proteins expressed outside the lungs that are triggered by more than one protein from the lungs. the case of golga2 is an example of a protein triggered by three proteins in the lungs. in table i , we list some of the proteins expressed mainly in tissues outside the lungs, which are heavily perturbed by proteins in the lungs. the table i. multi-organ propagation of perturbations. proteins mainly expressed outside the lungs are significantly perturbed during diffusive processes that have started at other proteins expressed in the lungs. act. is the number of lung proteins activators, p tot is the sum of the probabilities of finding the diffusive particle at this protein, and t mean is the average time of activation (see the text for explanations). the tissues of main expression are selected among the ones with the highest consensus normalized expression (nx) levels by combining the data from the three transcriptomics datasets (hpa, gtex, and fantom5) using the internal normalization pipeline. 49 boldface denotes the highest value in each of the columns. act. table s5 of the supplementary material. we give three indicators of the importance of the perturbation of these proteins. they are act., which is the number of proteins in the lungs that activate each of them; p tot , which is the sum of the probabilities of finding the diffusive particle at this protein for diffusive processes that have started in their activators; and t mean , which is the average time required by activators to perturb the corresponding protein. for instance, pkp2 is perturbed by 21 proteins in the lungs, which indicates that this protein, mainly expressed in the heart muscle, has a large chance of being perturbed by diffusive processes starting in proteins mainly located at the lungs. protein prim2 is activated by 5 proteins in the lungs, but if all these proteins were acting at the same time, the probability that prim2 is perturbed will be very high, p tot ≈ 0.536. finally, protein tle5 is perturbed by 13 proteins in the lungs, which needs as an average t mean ≈ 0.24 to perturb tle5. these proteins do not form a connected component among them in the network. the average shortest diffusion path between them is 5.286 with a maximum shortest subdiffusion path of 10. as an average, they are almost equidistant from the rest of the proteins in the network as among themselves. that is, the average shortest subdiffusion path between these proteins expressed outside the lungs and the rest of the proteins in the network is 5.106. therefore, these proteins can be reached from other proteins outside the lungs in no more than six steps in subdiffusive processes like the ones considered here. finally, we study here how the diffusive process determines the paths that the perturbation follows when diffusing from a protein to another not directly connected to it. the most efficient way of propagating a perturbation between the nodes of a network is through the shortest (topological) paths that connect them. the problem for a (sub)diffusive perturbation propagating between the nodes of a network is that it does not have complete information about the topology of the network as to know its shortest (topological) paths. the network formed by the proteins targeted by sars-cov-2 is very sparse, and this indeed facilitates that the perturbations occurs by following the shortest (topological) paths most of the time. think, for instance, in a tree, which has the lowest possible edge density among all connected networks. in this case, the perturbation will always use the shortest (topological) paths connecting pairs of nodes. however, in the case of the ppi network studied here, a normal diffusive process, i.e., α = 1, not always uses the shortest (topological) paths. in this case, there are 1294 pairs of proteins for which the diffusive particle uses a shortest diffusive path, which is one edge longer than the corresponding shortest (topological) path. this represents 6.11% of all total pairs of proteins that are interconnected by a path in the ppi network of proteins targeted by sars-cov-2. however, when we have a subdiffusive process, i.e., α = 0.75, this number is reduced to 437, which represents only 2.06% of all pairs of proteins. therefore, the subdiffusion process studied here through the ppi network of proteins targeted by sars-cov-2 has an efficiency of 97.9% relative to a process that always uses the shortest (topological) paths in hopping between proteins. in fig. 5 , we illustrate the frequency with which proteins not in the shortest (topological) paths are perturbed as a consequence that they are in the shortest subdiffusive paths between other proteins. for instance, the following is a shortest diffusive path between the two end points: rhoa-prkaca-prkar2a-cep43-rab7a-atp6ap1. the corresponding shortest (topological) path is rhoa-mark2-ap2m1-rab7a-atp6ap1, which is one edge smaller. the proteins prkaca, prkar2a, and cep43 are those in the diffusive path that are not in the topological one. repeating this selection process for all the diffusive paths that differs from the topological ones, we obtained the results illustrated in fig. 5 . as can be seen, there are 36 proteins visited by the shortest diffusive paths, which are not visited by the corresponding topological ones. the chaos article scitation.org/journal/cha average degree of these proteins is 7.28, and there is only a small positive trend between the degree of the proteins and the frequency with which they appear in these paths; e.g., the pearson correlation coefficient is 0.46. we have presented a methodology that allows the study of diffusive processes in (ppi) networks varying from normal to subdiffusive regimes. here, we have studied the particular case in which the time-fractional diffusion equation produces a subdiffusive regime, with the use of α = 3/4 in the network of human proteins targeted by sars-cov-2. a characteristic feature of this ppi network is that the second smallest eigenvalue is very small; i.e., µ 2 = 0.0647. as this eigenvalue determines the rate of convergence to the steady state, the subdiffusive process converges very slowly to that state. what it has been surprising is that even in these conditions of very small convergence to the steady state, there is a very early increase of the probability in those proteins closely connected to the initiator of the diffusive process. that is, in a subdiffusive process on a network, the time at which a perturbation is transmitted from the initiator to any of its nearest neighbors occurs at an earlier time than for the normal diffusion. this is a consequence of the fact that e α,1 (−γ t α ) decreases very fast at small values of t α , which implies that the perturbation occurring at a protein i at t = 0 is transmitted almost instantaneously to the proteins closely connected to i. this effect may be responsible for the explanation about why subdiffusive processes, which are so globally slow, can carry out cellular processes at a significant rate in cells. we have considered here a mechanism consisting in switching and restarting several times during the global cellular process. for instance, a subdiffusive process starting at the protein i perturbs its nearest neighbors at very early times, among which we can find the protein j. then, a new subdiffusive process can be restarted again at the node j and so on. one of the important findings of using the current model for the study of the pin of proteins affected by sars-cov-2 is the identification of those proteins that are expressed outside the lungs that can be more efficiently perturbed by those expressed in the lungs (see table i ). for instance, the protein with the largest number of activators, pkp2, appears mainly in the heart muscle. it has been observed that the elevation of cardiac biomarkers is a prominent feature of covid-19, which in general is associated with a worse prognosis. 53 myocardial damage and heart failure are responsible for 40% of death in the wuhan cohort (see references in ref. 53) . although the exact mechanism involving the heart injury is not known, the hypothesis of direct myocardial infection by sars-cov-2 is a possibility, which acts along or in combination with the increased cardiac stress due to respiratory failure and hypoxemia, and/or with the indirect injury from the systemic inflammatory response. [53] [54] [55] [56] as can be seen in table i , the testis is the tissue where several of the proteins targeted by sars-cov-2 are mainly expressed, e.g., cep43, tle5, prim2, mipol1, reep6, hook1, cenpf, trim59, and mark1. currently, there is no conclusive evidence about the testis damage by sars-cov-2. 57-60 however, the previous sars-cov that appeared in 2003 and which shares 82% of proteins with the current one produced testis damage and spermatogenesis, and it was concluded that orchitis was a complication of that previous sars disease. 57 we also detect a few proteins mainly expressed in different brain tissues, such as cep135, prim2, trim59, and mark1. the implication of sars-cov-2 and cerebrovascular diseases has been reported, including neurological manifestations as well as cerebrovascular disease, such as ischemic stroke, cerebral venous thrombosis, and cerebral hemorrhage. [61] [62] [63] kidney damage in sars-cov-2 patients has been reported, 64-66 which includes signs of kidney dysfunctions, proteinuria, hematuria, increased levels of blood urea nitrogen, and increased levels of serum creatinine. as much as 25% of an acute kidney injury has been reported in the clinical setting of sars-cov-2 patients. one of the potential mechanisms for kidney damage is the organ crosstalk, 64 as can be the mechanism of diffusion from proteins in the lungs to proteins in the urinary tract and kidney proposed here. a very interesting observation from table i is the existence of several proteins expressed mainly in the thymus and t-cells, such as tle5, retreg3, rbm41, cenpf, and trim59. it has been reported that many of the patients affected by sars-cov-2 in wuhan displayed a significant decrease of t-cells. 67 thymus is an organ that displays a progressive decline with age with reduction of the order of 3%-5% a year until approximately 30-40 years of age and of about 1% per year after that age. consequently, it was proposed that the role of thymus should be taken into account in order to explain why covid-19 appears to be so mild in children. 67 the protein tle5 is also expressed significantly in the lymph nodes. it was found by feng et al. 68 that sars-cov-2 induces lymph follicle depletion, splenic nodule atrophy, histiocyte hyperplasia, and lymphocyte reductions. the proteins hook1 and mipol1 are significantly expressed in the pituitary gland. there has been some evidence and concerns that covid-19 may also damage the hypothalamo-pituitary-adrenal axis that has been expressed by pal, 69 which may be connected with the participation of the previously mentioned proteins. another surprising finding of the current work is the elevated number of subdiffusive shortest paths that coincide with the shortest (topological) paths connecting pairs of proteins in the ppi of human proteins targeted by sars-cov-2. this means that the efficiency of the diffusive paths connecting pairs of nodes in this ppi is almost 98% in relation to a hypothetical process that uses the shortest (topological) paths in propagating perturbations between pairs of proteins. the 437 shortest diffusive paths reported here contain one more edge than the corresponding shortest (topological) paths. the proteins appearing in these paths would never be visited in the paths connecting two other proteins if only the shortest (topological) paths were used. what is interesting to note that 6 out of the 15 proteins that are mainly expressed outside the lungs are among the ones "crossed" by these paths. they are tle5 (thymus, lymph node, testis), pkp2 (heart muscle), cep135 (skeletal muscle, heart muscle, cerebral cortex, cerebellum), cep43 (testis), rbm41 (pancreas, t-cells, testis, retina), and retreg3 (prostate, thymus). this means that the perturbation of these proteins occurs not only through the diffusion from other proteins in the lungs directly to them, but also through some "accidental" diffusive paths between pairs of proteins that are both located in the lungs. all in all, the use of time-fractional diffusive models to study the propagation of perturbations on ppi networks seems a very promising approach. the model is not only biologically sounded but it also allows us to discover interesting hidden patterns of the chaos article scitation.org/journal/cha interactions between proteins and the propagation of perturbations among them. in the case of the pin of human proteins targeted by sars-cov-2, our current finding may help to understand potential molecular mechanisms for the multi-organs and systemic failures occurring in many patients. after this work was completed, qiu et al. 75 uploaded the manuscript entitled "postmortem tissue proteomics reveals the pathogenesis of multiorgan injuries of covid-19." the authors profiled the host responses to covid-19 by means of quantitative proteomics in postmortem samples of tissues in lungs, kidney, liver, intestine, brain, and heart. they reported differentially expressed proteins (deps) for these organs as well as virus-host ppis between 23 virus proteins and 110 interacting deps differentially regulated in postmortem lung tissues. according to their results, most deps (70.5%) appears in the lungs, followed by kidney (16.5%). additionally, qiu et al. 75 identified biological processes that were up-or down-regulated in the six postmortem tissue types. they found that most up-regulated processes in the lungs correspond to processes related to the response to inflammation and to immune response. however, pathways related to cell morphology, such as the establishment of endothelial barriers, were down-regulated in the lungs, which was interpreted as a confirmation that the lungs are the main focus of virus-host fights. other fundamental processes in the six organs analyzed postmortem were significantly down-regulated, which include processes related to organ movement, respiration, and metabolism. from the 59 proteins that we reported here as the ones with the largest effect on perturbing those 38 proteins identified in human diseases (see table s3 in the supplementary material), 18 were found to be down-regulated in the lungs by qiu et al. 75 if we make the corresponding adjustment, by considering that qiu et al. 75 considered 110 instead of 209 proteins in the ppi, the previous number represents 58.1% of proteins predicted here and experimentally found as down-regulated in the lungs. from the rest of the proteins, which were not found as having the largest effect on perturbing proteins identified in human disease, only 29.1% were reported by qiu et al. 75 to be down-regulated in the postmortem analysis of patients' lungs. among the proteins reported in table s3 of the supplementary material and by qiu et al., we have arf6, rtn4, rab7a, 6ng5, reep5, vps11 , rhoa, rab5c, among others. finally, among the proteins mainly expressed outside the lungs that are predicted in this work to be significantly perturbed, we have five that were found by qiu et al. 75 to be up-regulated in the different organs analyzed by them. from the proteins included in table i , qiu et al. 75 reported the following ones up-regulated: pkp2 (heart), reep6 (liver), hook1 (several organs), atp5me (heart), and slc27a2 (liver and kidney). they also reported cep43 (reported as fgfr1op) as downregulated in the brain. we should remark that we have considered here many more organs than the six ones studied by qiu et al. 75 there are no doubts that in considering a diffusive propagation of perturbations among proteins in a ppi, we have made a fig. 6. pi metaplex. in the metaplex, every node of the ppi corresponds to a protein and its crowded intracellular space. there is an internal dynamics in the nodes and an external between the nodes. few simplifications and assumptions. every protein is embedded in an intracellular crowded environment, which drives its diffusive mechanism. nowadays, it is well-established that this environment is conducive to molecular subdiffusive processes. as remarked by guigas and weiss, 27 far from obstructing cellular processes, subdiffusion increases the probability of finding a nearby target by a given protein, and therefore, it facilitates protein-protein interactions. the current approach can be improved using two recently developed theoretical frameworks: (i) metaplexes and (ii) d-path laplacian operators on graphs. a ppi metaplex, 70 a 4-tuple ϒ = (v, e, i, ω), where (v, e) is a graph, ω = { j } k j=1 is a set of locally compact metric spaces j with borel measures µ j , and i : v → ω is illustrated in fig. 6 . then, we define a dynamical ϒ = (v, e, i, ω = { k }) on the metaplex as a tuple (h, t ). here, h = h v : l 2 ( i (v), µ i (v)) → l 2 ( i (v), µ i (v))} v∈v is a family of operators such that the initial value problem ∂ t u v = h v (u v ), u v | t=0 = u 0 , is well-posed, and t = {t vw } (v,w)∈e is a family of bounded operators t vw : l 2 ( i(v) , µ i(v) ) → l 2 ( i(w) , µ i(w) ). this means that inside a node of the metaplex, we consider one protein and its crowded intracellular space. inside the nodes, we can have a dynamics like a time-fractional diffusion equation, the fractional fokker-planck equation, or any other in a continuous space. the inter-nodes dynamics is then dominated by a graph-theoretic diffusive model like the one presented here. the second possible improvement to the current model can be made by introducing the possibility of long-range interactions in the inter-nodes dynamics in the ppi metaplex. that is, instead of considering the time-fractional diffusion equation, which only accounts for subdiffusive processes in the graph, we can use the following generalization, which incorporates the d-path laplacian operators, 71 where l d is a generalization of the graph laplacian operator to account for long-range hops between nodes in a graph, d is the shortest path distance between two nodes, and s > 0 is a parameter. this equation has never been used before except that for the case α = 1 where superdiffusive behavior was proved in 1-and chaos article scitation.org/journal/cha 2-dimensional cases. 72, 73 other approaches have also been recently used for similar purposes in the literature. 74 we then hope that the combination of metaplexes and timeand space-fractional diffusive models do capture more of the details of protein-protein interactions in crowded cellular environments. see the supplementary material for a list of proteins targeted by sars-cov-2, which are found in the database disgenet as displaying "definitive" or "strong" evidence of participating in human diseases. the disease id is the code of the disease in disgenet. a list of proteins with the largest effect on perturbing those 38 proteins are identified in human diseases. p tot is the sum of the probabilities that the given protein activates those identified as having "definitive" or "strong" evidence of being involved in a human disease. there is an rna expression overview for proteins targeted by sars-cov-2 and mainly expressed outside the lungs. we select the top rna expressions in the four databases reported in the human protein atlas. there is a list of proteins mainly expressed outside the lungs and their major activators, which are proteins mainly expressed in the lungs. the author thanks dr. deisy morselli gysi for sharing data and information. the author is indebted to two anonymous referees whose clever comments helped in improving this work substantially. the data that support the findings of this study are available within the article and its supplementary material and also from the corresponding author upon reasonable request. scitation.org/journal/cha the species severe acute respiratory syndrome-related coronavirus: classifying 2019-ncov and naming it sars-cov-2 a pneumonia outbreak associated with a new coronavirus of probable bat origin a new coronavirus associated with human respiratory disease in china review of the clinical characteristics of coronavirus disease 2019 (covid-19) coronavirus disease 2019 (covid-19): a clinical update covid-19 and multi-organ response genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting wuhan angiotensin-converting enzyme 2 is a functional receptor for the sars coronavirus sars-cov-2 cell entry depends on ace2 and tmprss2 and is blocked by a clinically proven protease inhibitor specific ace2 expression in cholangiocytes may cause liver damage after 2019-ncov infection single-cell rna expression profiling of ace2, the putative receptor of wuhan 2019-ncov tissue distribution of ace2 protein, the functional receptor for sars coronavirus. a first step in understanding sars pathogenesis a sars-cov-2-human protein-protein interaction map reveals drug targets and potential drug-repurposing network medicine framework for identifying drug repurposing opportunities for covid-19 specificity and stability in topology of protein networks perturbation waves in proteins and protein networks: applications of percolation and game theories in signaling and drug design predicting perturbation patterns from the topology of biological networks modeling and simulating networks of interdependent protein interactions network medicine: a networkbased approach to human disease protein interaction networks in medicine and disease human diseases through the lens of network biology stochastic model of protein-protein interaction: why signaling proteins need to be colocalized accounting for conformational changes during protein-protein docking inferring novel tumor suppressor genes with a protein-protein interaction network and network diffusion algorithms information flow in interaction networks sampling the cell with anomalous diffusion-the discovery of slowness understanding biochemical processes in the presence of sub-diffusive behavior of biomolecules in solution and living cells protein motion in the nucleus: from anomalous diffusion to weak interactions random diffusivity from stochastic equations: comparison of two models of brownian yet non-gaussian diffusion anomalous protein diffusion in living cells as seen by fluorescence correlation spectroscopy anomalous subdiffusion is a measure for cytoplasmic crowding in living cell elucidating the origin of anomalous diffusion in crowded fluids in a mirror dimly: tracing the movements of molecules in living cells protein entrapment in polymeric mesh: diffusion in crowded environment with fast process on short scales from continuous time random walks to fractional fokker-planck equation a mathematical analysis of obstructed diffusion within skeletal muscle fractional calculus and waves in linear viscoelasticity. an introduction to mathematical models distributed coordination algorithms for multiple fractional-order systems distributed coordination of networked fractional-order systems consensus of networked multi-agent systems with delays and fractional-order dynamics the random walk's guide to anomalous diffusion: a fractional dynamics approach long-range navigation on complex networks using lévy random walks fractional dynamics on networks: emergence of anomalous diffusion and lévy flights fractional dynamics on networks and lattices nonlocal network dynamics via fractional graph laplacians diffusion maps the genotype-tissue expression (gtex) project tissue-based map of the human proteome a promoter-level mammalian expression atlas convergence speed of a fractional order consensus algorithm over undirected scale-free networks the disgenet knowledge platform for disease genomics: 2019 update covid-19 and the heart cardiac involvement in a patient with coronavirus disease 2019 (covid-19) covid-19 and the cardiovascular system coronavirus disease 2019 (covid-19) and cardiovascular disease sars-cov-2 and the testis: similarity to other viruses and routes of infection rising concern on damaged testis of covid-19 patients the need for urogenital tract monitoring in covid-19 ace2 expression in kidney and testis may cause kidney and testis damage after 2019-ncov infection covid-19, angiotensin receptor blockers, and the brain pulmonary, cerebral, and renal thromboembolic disease associated with covid-19 infection a case of coronavirus disease 2019 with concomitant acute cerebral infarction and deep vein thrombosis kidney involvement in covid-19 and rationale for extracorporeal therapies acute kidney injury in sars-cov-2 infected patients caution on kidney dysfunctions of covid-19 patients additional hypotheses about why covid-19 is milder in children than adults the novel severe acute respiratory syndrome coronavirus 2 (sars-cov-2) directly decimates human spleens and lymph nodes covid-19, hypothalamo-pituitary-adrenal axis and clinical implications metaplex networks: influence of the exo-endo structure of complex systems on diffusion path laplacian matrices: introduction and application to the analysis of consensus in networks path laplacian operators and superdiffusive processes on graphs. i. one-dimensional case path laplacian operators and superdiffusive processes on graphs. ii. two-dimensional lattice hopping in the crowd to unveil network topology key: cord-295307-zrtixzgu authors: delgado-chaves, fernando m.; gómez-vela, francisco; divina, federico; garcía-torres, miguel; rodriguez-baena, domingo s. title: computational analysis of the global effects of ly6e in the immune response to coronavirus infection using gene networks date: 2020-07-21 journal: genes (basel) doi: 10.3390/genes11070831 sha: doc_id: 295307 cord_uid: zrtixzgu gene networks have arisen as a promising tool in the comprehensive modeling and analysis of complex diseases. particularly in viral infections, the understanding of the host-pathogen mechanisms, and the immune response to these, is considered a major goal for the rational design of appropriate therapies. for this reason, the use of gene networks may well encourage therapy-associated research in the context of the coronavirus pandemic, orchestrating experimental scrutiny and reducing costs. in this work, gene co-expression networks were reconstructed from rna-seq expression data with the aim of analyzing the time-resolved effects of gene ly6e in the immune response against the coronavirus responsible for murine hepatitis (mhv). through the integration of differential expression analyses and reconstructed networks exploration, significant differences in the immune response to virus were observed in ly6e [formula: see text] compared to wild type animals. results show that ly6e ablation at hematopoietic stem cells (hscs) leads to a progressive impaired immune response in both liver and spleen. specifically, depletion of the normal leukocyte mediated immunity and chemokine signaling is observed in the liver of ly6e [formula: see text] mice. on the other hand, the immune response in the spleen, which seemed to be mediated by an intense chromatin activity in the normal situation, is replaced by ecm remodeling in ly6e [formula: see text] mice. these findings, which require further experimental characterization, could be extrapolated to other coronaviruses and motivate the efforts towards novel antiviral approaches. the recent sars-cov-2 pandemic has exerted an unprecedented pressure on the scientific community in the quest for novel antiviral approaches. a major concern regarding sars-cov-2 is the capability of the coronaviridae family to cross the species barrier and infect humans [1] . this, along with the tendency of coronaviruses to mutate and recombine, represents a significant threat to global health, which ultimately has put interdisciplinary research on the warpath towards the development of a vaccine or antiviral treatments. given the similarities found amongst the members of the coronaviridae family [2, 3] , analyzing the global immune response to coronaviruses may shed some light on the natural control of viral infection, and inspire prospective treatments. this may well be achieved from the perspective of systems biology, in which the interactions between the biological entities involved in a certain process are represented by means of a mathematical system [4] . within this framework, gene networks (gn) have become an important tool in the modeling and analysis of biological processes from gene expression data [5] . gns constitute an abstraction of a given biological reality by means of a graph composed by nodes and edges. in such a graph, nodes represent the biological elements involved (i.e., genes, proteins or rnas) and edges represent the relationships between the nodes. in addition, gns are also useful to identify genes of interest in biological processes, as well as to discover relationships among these. thus, they provide a comprehensive picture of the studied processes [6, 7] . among the different types of gns, gene co-expression networks (gcns) are widely used in the literature due to their computational simplicity and good performance in order to study biological processes or diseases [8] [9] [10] . gcns usually compute pairwise co-expression indices for all genes. then, the level of interaction between two genes is considered significant if its score is higher than a certain threshold, which is set ad hoc. traditionally, statistical-based co-expression indices have been used to calculate the dependencies between genes [5, 7] . some of the most popular correlation coefficients are pearson, kendall or spearman [11] [12] [13] . despite their popularity, statistical-based measures present some limitations [14] . for instance, they are not capable of identifying non-linear interactions and the dependence on the data distribution in the case of parametric correlation coefficients. in order to overcome some of these limitations, new approaches, e.g., the use of information theory-based measures or ensemble approaches, are receiving much attention [15] [16] [17] . gene co-expression networks (gcns) have already been applied to the study of dramatic impact diseases, such as cancer [18] , diabetes [19] or viral infections (e.g., hiv) in order to study the role of immune response to these illnesses [20, 21] . genetic approaches are expected to be the best strategy to understand viral infection and the immune response to it, potentially identifying the mechanisms of infection and assisting the design of strategies to combat infection [22, 23] . the current gene expression profiling platforms, in combination with high-throughput sequencing, can provide time-resolved transcriptomic data, which can be related to the infection process. the main objective of this approach is to generate knowledge on the immune functioning upon viral entry into the organism, which means mean a perturbation to the system. in the context of viral infection, a first defense line is the innate response mediated by interferons, a type of cytokines which eventually leads to the activation of several genes of antiviral function [24] . globally, these genes are termed interferon-stimulated genes (isgs), and regulate processes like inflammation, chemotaxis or macrophage activation among others. furthermore, isgs are also involved in the subsequent acquired immune response, specific for the viral pathogen detected [25] . gene ly6e (lymphocyte antigen 6 family member e), which has been related to t cell maturation and tumorogenesis, is amongst the isgs [26] . this gene is transcriptionally active in a variety of tissues, including liver, spleen, lung, brain, uterus and ovary. its role in viral infection has been elusive due to contradictory findings [27] . for example, in liu et al. [28] , ly6e was associated with the resistance to marek's disease virus (mdv) in chickens. moreover, differences in the immune response to mouse adenovirus type 1 (mav-1) have been attributed to ly6e variants [29] . conversely, ly6e has also been related to an enhancement of human immunodeficiency viruses (hiv-1) pathogenesis, by promoting hiv-1 entry through virus-cell fusion processes [30] . also in the work by mar et al. [31] , the loss of function of ly6e due to gene knockout reduced the infectivity of influenza a virus (iav) and yellow fever virus (yfv). this enhancing effect of ly6e on viral infection has also been observed in other enveloped rna viruses such as in west nile virus (wnv), dengue virus (den), zika virus (zikv), o'nyong nyong virus (onnv) and chikungunya virus (chikv) among others [32] . nevertheless, the exact mechanisms through which ly6e modulates viral infection virus-wise, and sometimes even cell type-dependently, require further characterization. in this work we present a time-resolved study of the immune response of mice to a coronavirus, the murine hepatitis virus (mhv), in order to analyze the implications of gene ly6e. to do so, we have applied a gcn reconstruction method called engnet [33] , which is able to perform an ensemble strategy to combine three different co-expression measures, and a topology optimization of the final network. engnet has outscored other methods in terms of network precision and reduced network size, and has been proven useful in the modeling of disease, as in the case of human post-traumatic stress disorder. the rest of the paper is organized as follows. in the next section, we propose a description of related works. in section 3, we first describe the dataset used in this paper, and then we introduce the engnet algorithm and the different methods used to infer and analyze the generated networks. the results obtained are detailed in section 4, while, in section 5, we propose a discussion of the results presented in the previous section. finally, in section 6, we draw the main conclusions of our work. as already mentioned, gene co-expression networks have been extensively applied in the literature for the understanding of the mechanisms underlying complex diseases like cancer, diabetes or alzheimer [34] [35] [36] . globally, gcn serve as an in silico genetic model of these pathologies, highlighting the main genes involved in these at the same time [37] . besides, the identification of modules in the inferred gcns, may lead to the discovery of novel biomarkers for the disease under study, following the 'guilt by association' principle. along these lines, gcns are also considered suitable for the study of infectious diseases, as those caused by viruses to the matter at hand [38] . to do so, multiple studies have analyzed the effects of viral infection over the organism, focusing on immune response or tissue damage [39, 40] . for instance, the analysis of gene expression using co-expression networks is shown in the work by pedragosa et al. [41] , where the infection caused by lymphocytic choriomeningitis virus (lcmv) is studied over time in mice spleen using gcns. in ray et al. [42] , gcns are reconstructed from different microarray expression data in order to study hiv-1 progression, revealing important changes across the different infection stages. similarly, in the work presented by mcdermott et al. [43] , the over-and under-stimulation of the innate immune response to severe acute respiratory syndrome coronavirus (sars-cov) infection is studied. using several network-based approaches on multiple knockout mouse strains, authors found that ranking genes based on their network topology made accurate predictions of the pathogenic state, thus solving a classification problem. in [39] , co-expression networks were generated by microarray analysis of pediatric influenza-infected samples. thanks to this study, genes involved in the innate immune system and defense to virus were revealed. finally, in the work by pan et al. [44] , a co-expression network is constructed based on differentially-expressed micrornas and genes identified in liver tissues from patients with hepatitis b virus (hbv). this study provides new insights on how micrornas take part in the molecular mechanism underlying hbv-associated acute liver failure. the alarm posed by the covid-19 pandemic has fueled the development of effective prevention and treatment protocols for 2019-ncov/sars-cov-2 outbreak [45] . due to the novelty of sars-cov-2, recent research takes similar viruses, such as sars-cov and middle east respiratory syndrome coronavirus (mers-cov), as a starting point. other coronaviruses, like mouse hepatitis virus (mhv), are also considered appropriate for comparative studies in animal models, as demonstrated in the work by de albuquerque et al. [46] and ding et al. [47] . mhv is a murine coronavirus (m-cov) that causes an epidemic illness with high mortality, and has been widely used for experimentation purposes. works like the ones by case et al. [48] and gorman et al. [49] , study the innate immune response against mhv arbitrated by interferons, and those interferon-stimulated genes with potential antiviral function. this is the case of gene ly6e, which has been shown to play an important role in viral infection, as well as various orthologs of the same gene [50, 51] . mechanistic approaches often involved the ablation of the gene under study, like in the work by mar et al. [31] , where gene knockout was used to characterize the implications of ly6e in influenza a infection. as it is the case of giotis et al. [52] , these studies often involve global transcriptome analyses, via rna-seq or microarrays, together with computational efforts, which intend to screen the key elements of the immune system that are required for the appropriate response. this approach ultimately leads experimental research through predictive analyses, as in the case of co-expression gene networks [53] . in the following subsections, the main methods and gcn reconstruction steps are addressed. first, in section 3.1, the original dataset used in the present work is described, together with the experimental design. then, in section 4.1, the data preprocessing steps are described. subsequently in section 3.3, key genes controlling the infection progression are extracted through differential expression analyses. finally, the inference of gcns and their analysis are detailed in sections 3.4 and 3.5, respectively. the original experimental design can be described as follows. the progression of the mhv infection at genetic level was evaluated in two genetic backgrounds: wild type (wt, ly6efl/fl) and ly6e knockout mutants (ko, ly6e ∆hsc ). the ablation of gene ly6e in all cell types is lethal, hence the ly6e ∆hsc strain contains a disrupted version of gene ly6e only in hematopoietic stem cells (hsc), which give rise to myeloid and lymphoid progenitors of all blood cells. wild type and ly6e ∆hsc mice were injected intraperitoneally with 5000 pfu mhv-a59. at 3 and 5 days post-injection (d p.i.), mice were euthanized and biological samples for rna-seq were extracted. the overall effects of mhv infection in both wt and ko strains was assessed in liver and spleen. in total 36 samples were analyzed, half of these corresponding to liver and spleen, respectively. from the 18 organ-specific samples, 6 samples correspond to mock infection (negative control), 6 to mhv-infected samples at 3 d p.i. and 6 to mhv-infected samples at 5 d p.i. for each sample, two technical replicates were obtained. libraries of cdna generated from the samples were sequenced using illumina novaseq 6000. further details on sample preparation can be found in the original article by pfaender et al. [54] . for the sake of simplicity, mhv-infected samples at 3 and 5 d p.i. will be termed 'cases', whereas mock-infection samples will be termed 'controls'. the original dataset consists of 72 files, one per sample replicate, obtained upon the mapping of the transcript reads to the reference genome. reads were recorded in three different ways, considering whether these mapped introns, exons or total genes. then, a count table was retrieved from these files by selecting only the total gene counts of each sample replicate file. pre-processing was performed using the edger [55] r package. the original dataset by pfaender et al. [54] was retrieved from geo (accession id: gse146074) using the geoquery [56] package. additional files on sample information and treatment were also used to assist the modeling process. by convention, a sequencing depth per gene below 10 is considered neglectable [57, 58] . genes meeting this criterion are known as low expression genes, and are often removed since they add noise and computational burden to the following analyses [59] . in order to remove genes showing less than 10 reads across all conditions, counts per million (cpm) normalization was performed, so possible differences between library sizes for both replicates would not affect the result. afterwards, principal components analyses (pca) were performed over the data in order to detect the main sources of variability across samples. pca were accompanied by unsupervised k-medoid clustering analyses, in order to identify different groups of samples. in addition, multidimensional scaling plots (mds) were applied to further separate samples according to their features. last, between-sample similarities were assessed through hierarchical clustering. the analyses of differential expression served a two-way purpose, (i) the exploration of the directionality in the gene expression changes upon viral infection, and (ii) the identification of key regulatory elements for the subsequent network reconstruction. in the present application, differentially-expressed genes (deg) were filtered from the original dataset and proceeded to the reconstruction process. this approximation enabled the modeling of the genetic relationships that are considered of relevance in the presented comparison [60] [61] [62] . in the present work mice samples were compared organ-wise depending on whether these corresponded to control, 3 d p.i. and 5 d p.i. the identification of deg was performed using the limma [63] r package, which provides non-parametric robust estimation of the gene expression variance. this package includes voom, a method that incorporates rna-seq count data into the limma workbench, originally designed for microarrays [64] . in this case, a minimum log2-fold-change (log2fc) of 2 was chosen, which corresponds to four fold changes in the gene expression level. p-value was adjusted by benjamini-hochberg [65] and the selected adjusted p-value cutoff was 0.05. in order to generate gene networks the engnet algorithm was used. this technique, presented in gómez-vela et al. [33] , is able to compute gene co-expression networks with a competitive performance compared other approaches from the literature. engnet performs a two-step process to infer gene networks: (a) an ensemble strategy for a reliable co-expression networks generation, and (b) a greedy algorithm that optimizes both the size and the topological features of the network. these two features of engnet offer a reliable solution for generating gene networks. in fact, engnet relies on three statistical measures in order to obtain networks. in particular, the measures used are the spearman, kendall and normalized mutual information (nmi), which are widely used in the literature for inferring gene networks. engnet uses these measures simultaneously by applying an ensemble strategy based on major voting, i.e., a relationship will be considered correct if at least 2 of the 3 measures evaluate the relationship as correct. the evaluation is based on different independent thresholds. in this work, the different thresholds were set to the values originally used in [33] : 0.9, 0.8 and 0.7 for spearman, kendall and nmi, respectively. in addition, as mentioned above, engnet performs an optimization of the topological structure of the networks obtained. this reduction is based on two steps: (i) the pruning of the relations considered of least interest in the initial network, and (ii) the analysis of the hubs present in the network. for this second step of the final network reconstruction, we have selected the same threshold that was used in [33] , i.e., 0.7. through this optimization, the final network produced by engnet results easier to analyze computationally, due to its reduced size. networks were imported to r for the estimation of topology parameters and the addition of network features that are of interest for the latter network analysis and interpretation. these attributes were added to the reconstructed networks to enrich the modeling using the igraph [66] r package. the networks were then imported into cytoscape [67] through rcy3 [68] for examination and analyses purposes. in this case, two kind of analyses were performed: (i) a topological analysis and (ii) an enrichment analysis. regarding the topological analysis, clustering evaluation was performed in order to identify densely connected nodes, which, according to the literature, are often involved in a same biological process [69] . the chosen clustering method was community clustering (glay) [70] , implemented via cytoscape's clustermaker app [71] , which has yielded significant results in the identification of densely connected modules [72, 73] . among the topology parameters, degree and edge betweenness were estimated. the degree of a node refers to the number of its linking nodes. on the other hand, the betweenness of an edge refers to the number of shortest paths which go through that edge. both parameters are considered as a measure of the implications of respectively nodes and edges in a certain network. particularly, nodes whose degree exceeds the average network node degree, the so called hubs, are considered key elements of the biological processes modeled by the network. in this particular case, the distribution of nodes' degree network was analyzed so those nodes whose degree exceeded a threshold were selected as hubs. this threshold is defined as q3 + 1.5 × iqr, where q3 is the third quartile and iqr the interquartile range of the degree distribution. this method has been widely used for the detection of upper outliers in non-parametric distributions [74, 75] , as it is the case. however, the outlier definition does not apply to this distribution since those nodes whose degree are far above the median degree are considered hubs. on the other hand, gene ontology (go) enrichment analysis provides valuable insights on the biological reality modeled by the reconstructed networks. the gene ontology consortium [76] is a data base that seeks for a unified nomenclature for biological entities. go has developed three different ontologies, which describe gene products in terms of the biological processes, cell components or molecular functions in which these are involved. ontologies are built out of go terms or annotations, which provide biological information of gene products. in this case, the clusterprofiler [77] r package, allowed the identification of the statistically over-represented go terms in the gene sets of interest. additional enrichment analyses were performed using david [78] . for both analyses, the complete genome of mus musculus was selected as background. finally, further details on the interplay of the genes under study was examined using the string database [79] . the reconstruction of gene networks that adequately model viral infection involves multiple steps, which ultimately shape the final outcome. first, in section 4.1, exploratory analyses and data preprocessing are detailed, which prompted the modeling rationale. then, in section 4.2, differential expression is evaluated for the samples of interest. finally, networks reconstruction and analysis are addressed in section 4.3. at the end, four networks were generated, both in an organand genotype-wise manner. a schematic representation of the gcn reconstruction approach is shown in figure 1 . general scheme for the reconstruction method. the preprocessed data was subjected to exploratory and differential expression analyses, which imposed the reconstruction rationale. four groups of samples were used to generate four independent networks, respectively modeling the immune response in the liver, both in the wt and the ko situations; and in the spleen, also in the wt and the ko scenarios. in order to remove low expression genes, a sequencing depth of 10 was found to correspond to an average cpm of 0.5, which was selected as threshold. hence, genes whose expression was found over 0.5 cpm in at least two samples of the dataset were maintained, ensuring that only genes which are truly being expressed in the tissue will be studied. the dataset was log2-normalized with priority to the following analyses, in accordance to the recommendations posed in law et al. [64] . the results of both pca and k-medoid clustering are shown in figure 2a . clustering of the log2-normalized samples revealed clear differences between liver and spleen samples. also, for each organ, three subgroups of analogous samples that cluster together are identified. these groups correspond to mock infection, mhv-infected mice at 3 d p.i. and mhv-infected mice at 5 d p.i. (dashed lines in figure 2a ). finally, subtle differences were observed in homologous samples of different genotypes ( figure a1 ). organ-specific pca revealed major differences between mhv-infected samples for ly6e ∆hsc and wt genotypes, at both 3 and 5 d p.i. these differences were not observed in the mock infection (control situation). organ-wise pca are shown in figure 2b ,c. the distances between same-genotype samples illustrate the infection-prompted genetic perturbation from the uninfected status (control) to 5 d p.i., where clear signs of hepatitis were observed according to the original physiopathology studies [54] . on the other hand, the differences observed between both genotypes are indicative of the role of gene ly6e in the appropriate response to viral infection. these differences are subtle in control samples, but in case samples, some composition biass is observed depending on whether these are ko or wt, especially in spleen samples. the comparative analysis of the top 500 most variable genes confirmed the differences observed in the pca, as shown in figure a2 . among the four different features of the samples under study: organ, genotype, sample type (case or control) and days post injection; the dissimilarities in terms of genotype were the subtlest. in the light of these exploratory findings, the network reconstruction approach was performed as follows. networks were reconstructed organ-wise, as these exhibit notable differences in gene expression. additionally, a main objective of the present work is to evaluate the differences in the genetic response in the wt situation compared to the ly6e ∆hsc ko background, upon the viral infection onset in the two mentioned tissues. for each organ, log2-normalized samples were coerced to generate time-series-like data, i.e., for each genotype, 9 samples will be considered as a set, namely 3 control samples, 3 case samples at 3 d p.i. and 3 case samples at 5 d p.i. both technical replicates were included. this rational design seeks for a gene expression span representative of the infection progress. thereby, control samples may well be considered as a time zero for the viral infection, followed by the corresponding samples at 3 and 5 d p.i. the proposed rationale is supported by the exploratory findings, which position 3 d p.i. samples between control and 5 d p.i. samples. at the same time, the reconstruction of gene expression becomes robuster with increasing number of samples. in this particular case, 18 measuring points are attained for the reconstruction of each one of the four intended networks, since two technical replicates were obtained per sample [80] . the differential expression analyses were performed over the four groups of 9 samples explained above, with the aim of examining the differences in the immune response between ly6e ∆hsc and wt samples. limma -voom differential expression analyses were performed over the log2-normalized counts, in order to evaluate the different genotypes whilst contrasting the three infection stages: control vs. cases at 3 d p.i., control vs. cases at 5 d p.i. and cases at 3 vs. 5 d p.i. the choice of a minimum absolute log2fc ≥ 2, enabled considering only those genes that truly effect changes between wt and ly6e ∆hsc samples, whilst maintaining a relatively computer-manageable number of deg for network reconstruction. the latter is essential for the yield of accurate network sparseness values, as this is a main feature of gene networks [5] . for both genotypes and organs, the results of the differential expression analyses reveal that mhv injection triggers a progressive genetic program from the control situation to the mhv-infected scenario at 5 d p.i., as shown in figure 3a . the absolute number of deg between control vs. cases at 5 d p.i. was considerably larger than in the comparison between control vs. cases at 3 d p.i. furthermore, in all cases, most of the deg in control vs. cases at 3 d p.i. are also differentially-expressed in the control vs. cases at 5 d p.i. comparison, as shown in figure 4 . regarding genes fold change, an overall genetic up-regulation is observed upon infection. around 70% of deg are upregulated for all the comparisons performed for wt samples, as shown in figure 3b . nonetheless, a dramatic reduce in this genetic up-regulation is observed, by contrast, in knockout samples, even limiting upregulated genes to nearly 50% in the control vs. cases at 3 d p.i. comparison of liver ly6e ∆hsc samples. the largest differences are observed in the comparison of controls vs. cases at 5 d p.i ( figures a3 and a4 ). these deg are of great interest for the understanding of the immune response of both wt and ko mice to viral infection. these genes were selected to filter the original dataset for latter network reconstruction. the commonalities between wt and ko control samples for both organs were also verified through differential expression analysis following the same criteria (log2fc > 2, p value < 0.05). the number of deg between wt and ko liver control samples (2) and between wt and ko spleen control samples (20) were not considered significant, so samples were taken as analogous starting points for infection. as stated above, the samples were arranged both organ and genotype-wise in order to generate networks which would model the progress of the disease in each scenario. gcns were inferred from log2-normalized expression datasets. a count of 1 was added at log2 normalization so the problem with remaining zero values was avoided. each network was generated exclusively taking into consideration their corresponding deg at control vs. cases at 5 d p.i., where larger differences were observed. four networks were then reconstructed from these previously-identified deg for liver wt samples (1133 genes), liver ko samples (1153 genes), spleen wt samples (506 genes) and spleen ko samples (426 genes). this approach results in the modeling of only those relationships that are related to the viral infection. each sample set was then fed to engnet for the reconstruction of the subsequent network. genes that remained unconnected due to weak relationships, which do not overcome the set threshold, were removed from the networks. furthermore, the goodness of engnet-generated models outperformed other well-known inference approaches, as detailed in appendix b. topological parameters were estimated and added as node attributes using igraph, together with log2fc, prior to cytoscape import. specifically, networks were simplified by removing potential loops and multiple edges. the clustering topological scrutiny of the reconstructed networks revealed neat modules in all cases, as shown in figure a5 . the number of clusters identified in each network, as well as the number of genes harbored in the clusters is shown in table a1 . as already mentioned, according to gene networks theory, nodes contained within the same cluster are often involved in the same biological process [5, 81] . in this context, the go-based enrichment analyses over the identified clusters may well provide an idea of the affected functions. only clusters containing more than 10 genes were considered, since this is the minimum number of elements required by the enrichment tool clusterprofiler. the results of the enrichment analyses revealed that most go terms were not shared between wt and ko homologous samples, as shown in figure 5 . in order to further explore the reconstructed networks, the intersection of ko and wt networks of a same organ was computed. this refers to the genes and relationships that are shared between both genotypes for a specific organ. additionally, the genes and relationships that were exclusively present at the wt and ko samples were also estimated, as shown in figure a6 . the enrichment analyses over the nodes, separated using this criterion, would reveal the biological processes that make the difference between in ly6e ∆hsc mice compared to wt ones. the results of such analyses are shown in figure a7 . finally, the exploration of nodes' degree distribution would reveal those genes that can be considered hubs. those nodes comprised within the top genes with highest degree (degree > q3 + 1.5 × iq), also known as upper outliers in the nodes distribution, were considered hubs. a representation of nodes' degree distribution throughout the four reconstructed networks is shown in figure 6 . these distributions are detailed in figure a8 . this method provided four cutoff values for the degree, 24, 39, 21 and 21, respectively for liver wt and ko, spleen wt and ko networks. above these thresholds, nodes would be considered as hubs in each network. these hubs are shown in tables a2-a5 . figure 5 . enrichment analyses performed over the main clusters identified in wt and ko networks of (a) liver and (b) spleen networks. gene ratio is defined by the number of genes used as input for the ernichment analyses associated with a particular go term divided by the total number of input genes. . boxplots representative of the degree distributions for each one of the four reconstructed networks. identified hubs, according to the q3 + 1.5 × iqr criterion, are highlighted in red. the degree cutoffs, above which nodes would be considered as hubs, were 24, 39, 21 and 21, respectively for liver wt, liver ko, spleen wt and spleen ko networks. note degree is represented in a log scale given that the reconstructed networks present a scale-free topology. in this work four gene networks were reconstructed to model the genetic response mhv infection in two tissues, liver and spleen, and in two different genetic backgrounds, wild type and ly6e ∆hsc . samples were initially explored in order to design an inference rationale. not only did the designed approach reveal major differences between the genetic programs in each organ, but also, between different subgroups of samples, in a time-series-like manner. noticeably, disparities between wt and ly6e ∆hsc samples were observed in both tissues, and differential expression analyses revealed relevant differences in terms of the immune response generated. hereby, our results predict the impact of ly6e ko on hsc, which resulted in an impaired immune response compared to the wt situation. overall, results indicate that the reconstruction rationale, elucidated from exploratory findings, is suitable for the modeling of the viral progression. regarding the variance in gene expression in response to virus, pca and k-medoid clustering revealed strong differences between samples corresponding to liver spleen, respectively (figure 2a ). these differences set the starting point for the modeling approach, in which samples corresponding to each organ were analyzed independently. this modus operandi is strongly supported by the tropism that viruses exhibit for certain tissues, which ultimately results in a differential viral incidence and charge depending on the organ [82] . in particular, the liver is the target organ of mhv, identified as the main disease site [83] . on the other hand, the role of the spleen in innate and adaptive immunity against mhv has been widely addressed [84, 85] . the organization of this organ allows blood filtration for the presentation of antigens to cognate lymphocytes by the antigen presenting cells (apcs), which mediate the immune response exerted by t and b cells [86] . as stated before, pca revealed differences between the three sample groups on each organ: control and mhv-infected at 3 and 5 d p.i. interestingly, between-groups differences are specially clear for liver samples (figure 2b) , whereas spleen samples are displayed in a continuum-like way. this becomes more evident in organ-wise pca (figure 2) , and was latter confirmed by the exploration of the top 500 most variable genes and differential expression analyses ( figure a2 ). furthermore, clear differences between wt and ly6e ∆hsc samples are observed in none of these analyses, although the examination of the differential expression and network reconstruction did exposed divergent immune responses for both genotypes. the differential expression analyses revealed the progressive genetic response to virus for both organs and genotypes (figures 3a and 4) . in a wt genetic background, mhv infection causes an overall rise in the expression level of certain genes, as most deg in cases vs. control samples are upregulated. however, in a ly6e ∆hsc genetic background, this upregulation is not as prominent as in a wt background, significantly reducing the number of upregulated genes (figure 3b) . besides, the number of deg in each comparison varies from wt to ly6e ∆hsc samples. attending at the deg in the performed comparisons, for both the wt and ko genotypes, liver cases at 3 d p.i. are more similar to liver cases at 5 d p.i. than to liver controls, since the number of deg between the first two measuring points is significantly lower than the number of deg between control and case samples at 3 d p.i. (figure 4a,b) . a different situation occurs in the spleen, where wt cases at 3 d p.i. are closer to control samples (figure 4c ), whereas ko cases at 3 d p.i. seem to be more related to cases at 5 d p.i. (figure 4d ). this was already suggested by hierarchical clustering in the analysis of the top 500 most variable genes, and could be indicative of a different progression of the infection impact on both organs, which could be modulated by gene ly6e, at least for the spleen samples. moreover, the results of the deg analyses indicate that the sole knockout of gene ly6e in hsc considerably affects the upregulating genetic program normally triggered by viral infection in wild type individuals (in both liver and spleen). interestingly, there are some genes in each organ and genotype that are differentially expressed in every comparison between the possible three sample types, controls, cases at 3 d p.i. and cases at 5 d p.i. these genes, which we termed highly deg, could be linked to the progression of the infection, as changes in their expression level occur with days post injection, according to the data. the rest of the deg, show an uprise or fall when comparing two sample types, which does not change significantly in the third sample type. alternatively, highly deg, shown in table a6 , exhibited three different expression patterns: (i) their expression level, initially low, rises from control to cases at 3 d p.i. and then rises again in cases at 5 d p.i. (ii) their expression level, initially high in control samples, falls at 3 d p.i. and falls even more at 5 d p.i cases. (iii) their expression level, initially low, rises from control to cases at 3 d p.i. but then falls at cases at 5 d p.i., when it is still higher than the initial expression level. these expression patterns, which are shown in figure a9 , might be used to keep track of the disease progression, differentiating early from late infection stages. in some cases, these genes exhibited inconsistent expression levels, specially at 5 d p.i. cases, which indicates the need for further experimental designs targeting these genes. highly deg could be correlated with the progression of the disease, as in regulation types (i) and (ii) or by contrast, be required exclusively at initial stages, as in regulation type (iii). notably, genes gm10800 and gm4756 are predicted genes which, to date, have been poorly described. according to the string database [79] , gm10800 is associated with gene lst1 (leukocyte-specific transcript 1 protein), which has a possible role in modulating immune responses. in fact, gm10800 is homologous to human gene piro (progranulin-induced-receptor-like gene during osteoclastogenesis), related to bone homeostasis [87, 88] . thus, we hypothesize that bone marrow-derived cell lines, including erythrocytes and leukocytes (immunity effectors), could also be regulated by gm10800. on the other hand, gm4756 is not associated to any other gene according to string. protein gm4756 is homologous to human protein dhrs7 (dehydrogenase/reductase sdr family member 7) isoform 1 precursor. nonetheless and to the best of our knowledge, these genes have not been previously related to ly6e, and could play a role in the immune processes mediated by this gene. finally, highly deg were not found exclusively present in wt nor ko networks, instead, these were common nodes of these networks for each organ. this suggests that highly deg might be of core relevance upon mhv infection, with a role in those processes independent on ly6e ∆hsc . besides, genes hykk, ifit3 and ifit3b; identified as highly deg throughout liver ly6e ∆hsc samples were also identified as hubs in the liver ko network. also gene saa3, highly deg across spleen ly6e ∆hsc samples was considered a hub in the spleen ko network. nevertheless, these highly deg require further experimental validation. the enrichment analyses of the identified clusters at each network revealed that most go terms are not shared between the two genotypes ( figure 5 ), despite the considerable amount of shared genes between the two genotypes for a same organ. the network reconstructed from liver wt samples reflects a strong response to viral infection, involving leukocyte migration or cytokine and interferon signaling among others. these processes, much related to immune processes, are not observed in its ko counterpart. the liver wt network presented four clusters ( figure a5a ). its cluster 1 regulates processes related to leukocyte migration, showing the implication of receptor ligand activity and cytokine signaling, which possibly mediates the migration of the involved cells. cluster 2 is related to interferon-gamma for the response to mhv, whereas cluster 3 is probably involved in the inflammatory response mediated by pro-inflammatory cytokines. last, cluster 4 is related to cell extravasation, or the leave of blood cells from blood vessels, with the participation of gene nipal1. the positive regulation observed across all clusters suggests the activation of these processes. overall, hub genes in this network have been related to the immune response to viral infection, as the innate immune response to the virus is the mediated by interferons. meanwhile, the liver ko network showed three main clusters ( figure a5b ). its cluster 1 would also be involved in defense response to virus, but other processes observed in the liver wt network, like leukocyte migration or cytokine activity, are not observed in this cluster nor the others. cluster 2 is then related to the catabolism of small molecules and cluster 3 is involved in acids biosynthesis. these processes are certainly ambiguous and do not correspond the immune response observed in the wt situation, which suggests a decrease in the immune response to mhv as a result of ly6e ablation in hsc. on the other hand, spleen wt samples revealed high nuclear activity potentially involving nucleosome remodeling complexes and changes in dna accessibility. histone modification is a type of epigenetic modulation which regulates gene expression. taking into account the central role of the spleen in the development of immune responses, the manifested relevance of chromatin organization could be accompanied by changes in the accessibility of certain dna regions with implications in the spleen-dependent immune response. this is supported by the reduced reaction capacity in the first days post-infection of ly6e ∆hsc samples compared to wt, as indicated by the number of deg between control and cases at 3 d p.i for these genotypes. the spleen wt network displayed three clusters ( figure a5c ). cluster 1, whose genes were all upregulated in ly6e ∆hsc samples at 5 d p.i. compared to mock infection, is mostly involved in nucleosome organization and chromatin remodelling, together with cluster 3. cluster 2 would also be related to dna packaging complexes, possibly in response to interferon, similarly to liver networks. instead, in spleen ko most genes take part in processes related to the extracellular matrix. in the spleen ko network, four clusters were identified ( figure a5d ). cluster 1 is related to the activation of an immune response, but also, alongside with clusters 2 and 4, to the extracellular matrix, possibly in relation with collagen, highlighting its role in the response to mhv. cluster 3 is implied in protease binding. the dramatic shut down in the ko network of the nuclear activity observed in the spleen wt network, leads to the hypothesis that the chromatin remodeling activity observed could be related to the activation of certain immunoenhancer genes, modulated by gene ly6e. in any case, further experimental validation of these results would provide meaningful insights in the face of potential therapeutic approaches (see appendix a for more details). the exploration of nodes memebership, depending on whether these exclusively belonged to wt or ko networks or, by contrast, were present in both networks, helped to understand the impairment caused by ly6e ∆hsc . in this sense, go enrichment analyses over these three defined categories of the nodes in the liver networks revealed that genes at their intersection are mainly related to cytokine production, leukocyte migration and inflammatory response regulation, in accordance to the phenotype described for mhv-infection [89] . however, a differential response to virus is observed in wt mice compared to ly6e-ablated. the nodes exclusively present at the wt liver network are related to processes like regulation of immune effector process, leukocyte mediated immunity or adaptive immune response. these processes, which are found at a relatively high gene ratio, are not represented by nodes exclusively present in the liver ko network. additionally, genes exclusively present at the wt network and the intersection network are upregulated in case samples with respect to controls ( figure a6a) , which suggests the activation of the previously mentioned biological processes. on the other hand, genes exclusively-present at the liver ko networks, mostly down-regulated, were found to be associated with catabolism. as for the spleen networks, genotype-wise go enrichment results revealed that the previously-mentioned intense nuclear activity involving protein-dna complexes and nucleosome assembly is mostly due to wt-exclusive genes. actually, these biological processes could be pinpointing cell replication events. analogously to the liver case, genes that were found exclusively present in the wt network and the intersection network are mostly upregulated, whereas in the case of ko-exclusive genes the upregulation is not that extensive. interestingly, the latter are mostly related to extracellular matrix (ecm) organization, which suggest the relevance of ly6e on these. other lymphocyte antigen-6 (ly-6) superfamily members have been related to ecm remodelling processes such as the urokinase receptor (upar), which participates in the proteolysis of ecm proteins [90] . however and to the best of our knowledge, the implications of ly6e in ecm have not been reported. the results presented are in the main consistent with those by pfaender et al. [54] , who observed a loss of genes associated with the type i ifn response, inflammation, antigen presentation, and b cells in infected ly6e ∆hsc mice. genes stat1 and ifit3, selected in their work for their high variation in absence of ly6e, were identified as hub genes in the networks reconstructed from liver wild type and knockout samples, respectively. it is to be noticed that our approach significantly differs to the one carried out in the original study. in this particular case, we consider that the reconstruction of gcn enables a more comprehensive analysis of the data, potentially finding the key genes involved in the immune response onset and their relationships with other genes. for instance, the transcriptomic differences between liver and spleen upon ly6e ablation become more evident using gcn. altogether, the presented results show the relevance of gene ly6e in the immune response against the infection caused by mhv. the disruption of ly6e significantly reduced the immunogenic response, affecting signaling and cell effectors. these results, combining in vivo and in silico approaches, deepen in our understanding of the immune response to viruses at the gene level, which could ultimately assist the development of new therapeutics. for example, basing on these results, prospective studies on ly6e agonist therapies could be inspired, with the purpose of enhancing the gene expression level via gene delivery. given the relevance of ly6e in sars-cov-2 according to previous studies [54, 91] , the overall effects of ly6e ablation in hscs upon sars-cov-2 infection, putting special interest in lung tissue, might show similarities with the deficient immune response observed in the present work. in this work we have presented an application of co-expression gene networks to analyze the global effects of ly6e ablation in the immune response to mhv coronavirus infection. to do so, the progression of the mhv infection on the genetic level was evaluated in two genetic backgrounds: wild type mice (wt, ly6efl/fl) and ly6e knockout mutants (ko, ly6e ∆hsc ) mice. for these, viral progression was assessed in two different organs, liver and spleen. the proposed reconstruction rationale revealed significant differences between mhv-infected wt and ly6e ∆hsc mice for both organs. in addition we observed that mhv infection triggers a progressive genetic response of upregulating nature in both liver and spleen. in addition, the results suggest that the ablation of gene ly6e at hsc caused an impaired genetic response in both organs compared to wt mice. the impact of such ablation is more evident in the liver, consistently with the disease site. at the same time, the immune response in the spleen, which seemed to be mediated by an intense chromatin activity in the normal situation, is replaced by ecm remodeling in ly6e ∆hsc mice. we infer that the presence of ly6e limits the damage in the above mentioned target sites. we believe that the characterization of these processes could motivate the efforts towards novel antiviral approaches. finally, in the light of previous works, we hypothesize that ly6e ablation might show analogous detrimental effects on immunity upon the infection caused by other viruses including sars-cov, mers and sars-cov-2. in future works, we plan to investigate whether the over-expression of ly6e in wt mice has an enhancement effect in immunity. in this direction, ly6e gene mimicking (agonist) therapies could represent a promising approach in the development of new antivirals. the authors declare no conflict of interest. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q organ q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q genotype q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q sample type q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q top 500 most variable genes across liver samples row z−score 0 top 500 most variable genes across spleen samples table a1 . number of deg used as input to engnet for network reconstruction and their latter distribution in inferred networks. genes that were not assigned to a cluster (or were comprised in minoritary clusters) were not taken into consideration for enrichment analyses. input genes 1133 1153 506 426 network genes 1118 1300 485 403 cluster 1 262 284 180 109 cluster 2 218 379 255 190 cluster 3 579 624 36 77 cluster 4 59 figure a7 . enrichment analyses based on node exclusiveness of (a) liver and (b) spleen networks. wt refers to nodes exclusively present at those networks reconstructed from wt samples; ko refers to nodes exclusively present at networks reconstructed from ly6e ∆hsc samples; both addresses shared nodes between wt and ko networks. gene ratio is defined by the number of genes used as input for the ernichment analyses associated with a particular go term divided by the total number of input genes. expression of highly deg across spleen ko samples (d) figure a9 . cpm-normalized expression values of highly deg identified across (a) liver wt samples, (b) liver ko samples, (c) spleen wt samples and (d) spleen ko samples. dashed lines separate samples from the three groups under study: controls, cases at 3 d p.i. and cases at 5 d p.i. note sample order within same group is exchangeable. the reconstruction method employed in this case study was validated against other thee well-known inference methods: aracne [93] , wgcna [94] and wto [95] . the output of each reconstruction method, using default values (including engnet) was compared to a gold standard (gs), retrieved from the string database. four different gss were taken into consideration, since these were reconstructed from the deg that were identified in the comparison of control vs. case samples at 5 d p.i., as shown in section 4.2. these deg were mapped to the string database gene identifiers selecting mus musculus as model organism (taxid: 10090). a variable percentage of deg (6-20%) could not be assigned to a string identifier, and were thus removed from the analysis. the interactions exclusively concerning the resulting deg in each case were retrieved from the string database. these interaction networks would serve as gss. the mentioned deg (without unmapped identifiers) would also serve as input for the four reconstruction methods to be compared. the aracne networks were inferred using the spearman correlation coefficient following the implementations in the minet [96] r package. in this case, mutual information values were normalized and scaled in the range 0-1. on the other hand, the wgcna networks were reconstructed following the original tutorial provided by the authors [97] . the power was defined as 5. additionally, the wto networks were built using pearson correlation in accordance to the documentation. absolute values were taken as relationship weights. finally, engnet networks were inferred using the default parameters described in the original article by gómez-vela et al. [33] . for the comparison, the receiver operating characteristic (roc)-curve was estimated using the proc [98] r package. roc curves are shown in figure a10 . the area under the roc curve (auc) was also computed in each case for the quantitative comparison of the methods, as shown in figure a11a . the auc compares the reconstruction quality of each method against random prediction. an auc ≈ 1 corresponds to the perfect classifier whereas am auc ≈ 0.5 approximates to a random classifier. thus, the higher the auc, the better the predictions. on average, engnet provided the best auc results, whilst maintaining a good discovery rate. in addition, engnet provided relatively scarce networks compared to wgcna, as shown in figure a11b . this is considered of relevance given that sparseness is a main feature of gene networks [7] . hosts and sources of endemic human coronaviruses identification and characterization of severe acute respiratory syndrome coronavirus replicase proteins an orally bioavailable broad-spectrum antiviral inhibits sars-cov-2 in human airway epithelial cell cultures and multiple coronaviruses in mice a first course in systems biology computational methods for gene regulatory networks reconstruction and analysis: a review gene network coherence based on prior knowledge using direct and indirect relationships gene regulatory network inference: data integration in dynamic models-a review structure optimization for large gene networks based on greedy strategy comprehensive analysis of the long noncoding rna expression profile and construction of the lncrna-mrna co-expression network in colorectal cancer a new cytoscape app to rate gene networks biological coherence using gene-gene indirect relationships evaluation of gene association methods for coexpression network construction and biological knowledge discovery a comparative study of statistical methods used to identify dependencies between gene expression signals ranking genome-wide correlation measurements improves microarray and rna-seq based global and targeted co-expression networks wisdom of crowds for robust gene network inference comparison of co-expression measures: mutual information, correlation, and model based indices mider: network inference with mutual information distance and entropy reduction bioinformatics analysis and identification of potential genes related to pathogenesis of cervical intraepithelial neoplasia lsd1 activates a lethal prostate cancer gene network independently of its demethylase function diverse type 2 diabetes genetic risk factors functionally converge in a phenotype-focused gene network survivin (birc5) cell cycle computational network in human no-tumor hepatitis/cirrhosis and hepatocellular carcinoma transformation coexpression network analysis in chronic hepatitis b and c hepatic lesions reveals distinct patterns of disease progression to hepatocellular carcinoma reverse genetics approaches for the development of influenza vaccines how viral genetic variants and genotypes influence disease and treatment outcome of chronic hepatitis b. time for an individualised approach? accessory proteins 8b and 8ab of severe acute respiratory syndrome coronavirus suppress the interferon signaling pathway by mediating ubiquitindependent rapid degradation of interferon regulatory factor 3 interferon-stimulated genes: a complex web of host defenses distinct lymphocyte antigens 6 (ly6) family members ly6d, ly6e, ly6k and ly6h drive tumorigenesis and clinical outcome emerging role of ly6e in virus-host interactions identification of chicken lymphocyte antigen 6 complex, locus e (ly6e, alias sca2) as a putative marek's disease resistance gene via a virus-host protein interaction screen polymorphisms in ly6 genes in msq1 encoding susceptibility to mouse adenovirus type 1 interferon-inducible ly6e protein promotes hiv-1 infection ly6e mediates an evolutionarily conserved enhancement of virus infection by targeting a late entry step flavivirus internalization is regulated by a size-dependent endocytic pathway ensemble and greedy approach for the reconstruction of large gene co-expression networks identification of candidate mirna biomarkers for pancreatic ductal adenocarcinoma by weighted gene co-expression network analysis a comprehensive analysis on preservation patterns of gene co-expression networks during alzheimer's disease progression gene co-expression network analysis for identifying modules and functionally enriched pathways in type 1 diabetes gene co-expression analysis for functional classification and gene-disease predictions systems analysis reveals complex biological processes during virus infection fate decisions identifying novel biomarkers of the pediatric influenza infection by weighted co-expression network analysis comprehensive innate immune profiling of chikungunya virus infection in pediatric cases linking cell dynamics with gene coexpression networks to characterize key events in chronic virus infections discovering preservation pattern from co-expression modules in progression of hiv-1 disease: an eigengene based approach the effect of inhibition of pp1 and tnfα signaling on pathogenesis of sars coronavirus the regulatory role of microrna-mrna co-expression in hepatitis b virus-associated acute liver failure sars-cov-2 entry factors are highly expressed in nasal epithelial cells together with innate immune genes murine hepatitis virus strain 1 produces a clinically relevant model of severe acute respiratory syndrome in a/j mice the nucleocapsid proteins of mouse hepatitis virus and severe acute respiratory syndrome coronavirus share the same ifn-β antagonizing mechanism: attenuation of pact-mediated rig-i/mda5 activation murine hepatitis virus nsp14 exoribonuclease activity is required for resistance to innate immunity the interferon-stimulated gene ifitm3 restricts west nile virus infection and pathogenesis organization, evolution and functions of the human and mouse ly6/upar family genes interferon-stimulated gene ly6e enhances entry of diverse rna viruses chicken interferome: avian interferon-stimulated genes identified by microarray and rna-seq of primary chick embryo fibroblasts treated with a chicken type i interferon (ifn-α) integrative network biology framework elucidates molecular mechanisms of sars-cov-2 pathogenesis edger: a bioconductor package for differential expression analysis of digital gene expression data geoquery: a bridge between the gene expression omnibus (geo) and bioconductor evaluation of statistical methods for normalization and differential expression in mrna-seq experiments orchestrating high-throughput genomic analysis with bioconductor heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences systems approach identifies tga 1 and tga 4 transcription factors as important regulatory components of the nitrate response of a rabidopsis thaliana roots computational inference of gene co-expression networks for the identification of lung carcinoma biomarkers: an ensemble approach step-by-step construction of gene co-expression networks from high-throughput arabidopsis rna sequencing data limma powers differential expression analyses for rna-sequencing and microarray studies precision weights unlock linear model analysis tools for rna-seq read counts false discovery control with p-value weighting the igraph software package for complex network research cytoscape 2.8: new features for data integration and network visualization network biology using cytoscape from within r gene co-opening network deciphers gene functional relationships community structure analysis of biological networks a multi-algorithm clustering plugin for cytoscape topological analysis and interactive visualization of biological networks and protein structures selectivity determinants of gpcr-g-protein binding boxplot-based outlier detection for the location-scale family outlier detection: how to threshold outlier scores? gene ontology consortium: going forward clusterprofiler: an r package for comparing biological themes among gene clusters systematic and integrative analysis of large gene lists using david bioinformatics resources the string database in 2017: quality-controlled protein-protein association networks, made broadly accessible massive-scale gene co-expression network construction and robustness testing using random matrix theory uncovering biological network function via graphlet degree signatures. cancer inform viral pathogenesis structure-guided mutagenesis alters deubiquitinating activity and attenuates pathogenesis of a murine coronavirus crosstalk of liver immune cells and cell death mechanisms in different murine models of liver injury and its clinical relevance a disparate subset of double-negative t cells contributes to the outcome of murine fulminant viral hepatitis via effector molecule fibrinogen-like protein 2 structure and function of the immune system in the spleen progranulin and a five transmembrane domain-containing receptor-like gene are the key components in receptor activator of nuclear factor κb (rank)-dependent formation of multinucleated osteoclasts rank is essential for osteoclast and lymph node development autologous intramuscular transplantation of engineered satellite cells induces exosome-mediated systemic expression of fukutin-related protein and rescues disease phenotype in a murine model of limb-girdle muscular dystrophy type 2i the intriguing role of soluble urokinase receptor in inflammatory diseases ly6e restricts the entry of human coronaviruses, including the currently pandemic sars-cov-2 aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context wgcna: an r package for weighted correlation network analysis wto: an r package for computing weighted topological overlap and a consensus network with integrated visualization tool ar/bioconductor package for inferring large transcriptional networks using mutual information a general framework for weighted gene co-expression network analysis proc: an open-source package for r and s+ to analyze and compare roc curves this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license key: cord-230294-bjy2ixcj authors: stella, massimo; restocchi, valerio; deyne, simon de title: #lockdown: network-enhanced emotional profiling at the times of covid-19 date: 2020-05-09 journal: nan doi: nan sha: doc_id: 230294 cord_uid: bjy2ixcj the covid-19 pandemic forced countries all over the world to take unprecedented measures like nationwide lockdowns. to adequately understand the emotional and social repercussions, a large-scale reconstruction of how people perceived these unexpected events is necessary but currently missing. we address this gap through social media by introducing mercurial (multi-layer co-occurrence networks for emotional profiling), a framework which exploits linguistic networks of words and hashtags to reconstruct social discourse describing real-world events. we use mercurial to analyse 101,767 tweets from italy, the first country to react to the covid-19 threat with a nationwide lockdown. the data were collected between 11th and 17th march, immediately after the announcement of the italian lockdown and the who declaring covid-19 a pandemic. our analysis provides unique insights into the psychological burden of this crisis, focussing on: (i) the italian official campaign for self-quarantine (#iorestoacasa}), (ii) national lockdown (#italylockdown), and (iii) social denounce (#sciacalli). our exploration unveils evidence for the emergence of complex emotional profiles, where anger and fear (towards political debates and socio-economic repercussions) coexisted with trust, solidarity, and hope (related to the institutions and local communities). we discuss our findings in relation to mental well-being issues and coping mechanisms, like instigation to violence, grieving, and solidarity. we argue that our framework represents an innovative thermometer of emotional status, a powerful tool for policy makers to quickly gauge feelings in massive audiences and devise appropriate responses based on cognitive data. the stunningly quick spread of the covid-19 pandemic catalysed the attention of worldwide audiences, overwhelming individuals with a deluge of often contrasting content about the severity of the disease, the uncertainty of its transmission mechanisms, and the asperity of the measures taken by most countries to fight it [1, 2, 3, 4] . although these policies have been seen as necessary, they had a tremendous impact on the mental well-being of large populations [5] for a number of reasons. due to lockdowns, many are facing financial uncertainty, having lost or being on the verge of losing their source of income. moreover, there is much concern about the disease itself, and most people fear for their own health and that of their loved ones [6] , further fueled by infodemics [2, 3, 1] . finally, additional distress is caused by the inability of maintaining a normal life [7] . the extent of the impact of these factors is such that, in countries greatly struck by covid-19 such as china, the population started to develop symptoms of post-traumatic stress disorder [8] . during this time more than ever, people have shared their emotions on social media. these platforms provide an excellent emotional thermometer of the population, and have been widely explored in previous studies investigating how online social dynamics promote or hamper content diffusion [2, 9, 1, 10, 11] and the adoption of specific positive/negative attitudes and behaviours [9, 12, 13 ]. building on the above evidence, our goal is to draw a comprehensive quantitative picture of people's emotional profiles, emerging during the covid-19 crisis, through a cognitive analysis of online social discourse. we achieve this by introducing mercurial (multi-layer co-occurrence networks for emotional profiling), a framework that combines cognitive network science [14, 15, 16] with computational social sciences [9, 12, 2, 17, 18] . before outlining the methods and main contributions of our approach, we briefly review existing research on understanding emotions in social media. much of the research on emotions in social media has been consolidated into two themes. on the one hand, there is the data science approach, which mostly focused over large-scale positive/negative sentiment detection [9] and recently identified the relevance of tracing more complex affect patterns for understanding social dynamics [19, 20, 21, 16] . on the other hand, cognitive science research makes use of small-scale analysis tools, but explores the observed phenomena in much more detail in the light of its theoretical foundations [22, 23, 24] . specifically, in cognitive science the massive spread of semantic and emotional information through verbal communication represent long-studied phenomena, known as cognitive contagion [24] and emotional contagion [24, 25, 23] , respectively. this research suggests that ideas are composed of a cognitive component and an emotional content, much alike viruses containing the genomic information necessary for their replication [1] . both these types of contagion happen when an individual is affected in their behaviour by an idea. emotions elicited by ideas can influence users' behaviour without their awareness, resulting in the emergence of specific behavioural patterns such as implicit biases [23] . unlike pathogen transmission, no direct contact is necessary for cognitive and emotional contagion to take place, since both are driven by information processing and diffusion, like it happens through social media [26, 27] . in particular, during large-scale events, ripples of emotions can rapidly spread across information systems [27] and have dramatic effects, as it has recently been demonstrated in elections and social movements [28, 12, 25] . at the intersection of data-and cognitive science is emotional profiling, a set of techniques which enables the reconstruction of how concepts are emotionally perceived and assembled in usergenerated content [17, 18, 9, 19, 15, 29] . emotional profiling conveys information about basic affective dimensions such how positive/negative or how arousing a message is, and also includes the analysis of more fine-grained emotions such as fear or trust that might be associated with the lockdown and people's hopes for the future [9, 30, 22] . recently, an emerging important line of research has shown that reconstructing the knowledge embedded in messages through social and information network models [31, 14, 32] successfully highlight important phenomena in a number of contexts, ranging from the diffusion of hate speech during massive voting events [12] to reconstructing personality traits from social media [17] . importantly, to reconstruct knowledge embedded in tweets, recent work has successfully merged data science and cognitive science, introducing linguistic networks of co-occurrence relationships between words in sentences [16, 33, 21] and between hashtags in tweets [12] . however, an important shortfall of these works is that these two types of networked knowledge representations were not merged together, thus missing on the important information revealed by studying their interdependence. we identify three important contributions that distinguish our paper from previous literature, and make a further step towards consolidating cognitive network science [14] as a paradigm suitable to analyse people's emotions. first, we introduce a new framework exploiting the interdependence between hashtags and words, addressing the gap previously discussed. this framework, multi-layer co-occurrence networks for emotional profiling (mercurial), combines both the semantic structure encoded through the co-occurrence of hashtags and the textual message to construct a multi-layer lexical network [34] . this multi-layer network structure allows us to contextualise hashtags and, therefore, improve the analysis of their meaning. importantly, these networks can be used to identify which concepts or words contribute to different emotions and how central they are. second, in contrast to previous work, which largely revolved around english tweets [4, 20] , the current study focusses on italian twitter messages. there are several reasons why the emotional response of italians is particularly interesting. specifically, i) italy was the first western country to experience a vast number of covid-19 clusters; ii) the italian government was the first to declare a national lockdown 1 ; iii), the italian lockdown was announced on 10th march, one day before the world health organization (who) declared the pandemic status of covid-19. this enables us to address the urgent need of measuring the emotional perceptions and reactions to social distancing, lockdown, and, more generally, the covid-19 pandemic. third, thanks to mercurial, we obtain richer and more complex emotional profiles that we analyse through the lens of established psychological theories of emotion. this is a fundamental step in going beyond positive/neutral/negative sentiment and to provide accurate insights on the mental well-being of a population. to this end, we take into account three of the most trending hashtags, #iorestoacasa (english: "i stay at home"), #sciacalli (english: "jackals"), and #italylockdown, as representative of positive, negative, and neutral social discourse, respectively. we use these hashtags as a starting point to build multi-layer networks of word and hashtag co-occurrence, from which we derive our profiles. our results depict a complex map of emotions, suggesting that there is co-existence and polarisation of conflicting emotional states, importantly fear and trust towards the lockdown and social distancing. the combination of these emotions, further explored through semantic network analysis, indicates mournful submission and acceptance towards the lockdown, perceived as a measure for preventing contagion but with negative implications over economy. as further evidence of the complexity of the emotional response to the crisis, we also find strong signals of hope and social bonding, mainly in relation to social flash mobs, and interpreted here as psychological responses to deal with the distress caused by the threat of the pandemic. the paper is organised as follows. in the methods section we describe the data we used to perform our analysis, and describe mercurial in detail. in the results section we present the emotional profiles obtained from our data, which are then discussed in more detail in the section discussion. finally, the last section highlights the psychological implications of our exploratory investigation and its potential for follow-up monitoring of covid-19 perceptions in synergy with other datasets/approaches. we argue that our findings represent an important first step towards monitoring both mental well-being and emotional responses in real time, offering policy-makers a framework to make timely data-informed decisions. in this section we describe the methodology employed to collect our data and perform the emotional profiling analysis. first, we describe the dataset and how it was retrieved. then, we introduce cooccurrence networks, and specifically our novel method that combines hasthag co-occurrence with word co-occurrence on multi-layer networks. finally, we describe the cognitive science framework we used to perform the emotional profiling analysis on the so-obtained networks. we gathered 101,767 tweets in italian to monitor how online users perceived the covid-19 pandemic and its repercussions in italy. these tweets were gathered by crawling messages containing three trending hashtags of relevance for the covid-19 outbreak in italy and expressing three different sentiment polarities: • #iorestoacasa (english: "i stay at home"), a positive-sentiment hashtag introduced by the italian government in order to promote a responsible attitude during the lockdown; • #sciacalli, (english: "jackals"), a negative sentiment hashtag used by online users in order to address unfair behaviour rising during the health emergency; • #italylockdown, a neutral sentiment hashtag indicating the application of lockdown measures all over italy. we refer to #iorestoacasa, #sciacalli and #italylockdown as focal hashtags to distinguish them from other hashtags. we collected the tweets through complex science consulting (@complexconsult), which was authorised by twitter, and used the serviceconnect crawler implemented in mathematica 11.3. the collection of tweets comprises 39,943 tweets for #iorestoacasa, 26,999 for #sciacalli and 34,825 for #italylockdown. retweets of the same text message were not considered. for each tweet, the language was detected. pictures, links, and non-italian content was discarded and stopwords (i.e. words without intrinsic meaning such as "di" (english: "of") and "ma" (english: "but") removed. other interesting datasets with tweets about covid-19 are available in [35, 20] . word co-occurrence networks have been successfully used to characterise a wide variety of phenomena related to language acquisition and processing [16, 36, 37] . recently, researchers have also used hashtags to investigate various aspects of social discourse. for instance, stella et al. [12] showed that hashtag co-occurrence networks were able to characterise important differences in the social discourses promoted by opposing social groups during the catalan referendum. in this work we introduce mercurial (multi-layer co-occurrence networks for emotional profiling), a framework combining: • hashtag co-occurrence networks (or hashtag networks) [12] . nodes represent hashtags and links indicate the co-occurrence of any two nodes in the same tweet. • word co-occurrence networks (or word networks) [16] . nodes represent words and links represent the co-occurrence of any two words one after the other in a tweet without stop-words (i.e. words without an intrinsic meaning). we combine these two types of networks in a multi-layer network to exploit the interdependence between hashtags and words. this new, resulting network enables us to contextualise hashtags, and capture their real meaning through context, thereby enhancing the accuracy of the emerging emotional profile. to build the multi-layer network, we first build the single hashtag and word layers. for sake of simplicity, word networks are unweighted and undirected 2 . note that the hashtag network was kept at a distinct level from word networks, e.g. common words were not explicitly linked with hashtags. as reported in figure 1 , each co-occurrence link between any two hashtags a and b (#coronavirus and #restiamoacasa in the figure) is relative to a word network, including all words co-occurring in all tweets featuring hashtags a and b. the hashtag and word networks capture the co-occurrence of lexical entities within the structured online social discourse. words possess meaning in language [31] and their network assembly is evidently a linguistic network. similar to words in natural language, hashtags possess linguistic features that express a specific meaning and convey rich affect patterns [12] . the resulting networks capture the meaning of a collection of tweets by identifying which words/hashtags co-occurred together. this knowledge embedded in hashtag networks was used in order to identify the most relevant or central terms associated within a given collection of thematic tweets. rather than using frequency to indicate centrality, which makes it difficult to compare hashtags that do not co-occur in the same message, the current work relies on distance-based measures to detect how central a hashtag is in the network. the first measure that implements this notion is closeness centrality. closeness c(i) identifies how many links connect i to all its neighbours and is formalised as follows: where d ij is the network distance between i and j, i.e. the smallest amount of links connecting nodes i and j. in co-occurrence networks, nodes (i.e. hashtags and words) with a higher closeness tend to co-occur more often with each other or with other relevant nodes at short network distance. we expect that rankings of closeness centrality will reveal the most central hashtags in the networks for #iorestoacasa, #sciacalli and #italylockdown, in line with previous work in which closeness centrality was used to measure language acquisition and processing [39, 14, 32] . importantly, closeness is a more comprehensive approach compared to the simpler frequency analysis. imagine a collection of hashtags a, b, c, d, .... computing the frequency of hashtag figure 1 : top: example of co-occurrence networks for different hashtags: #distantimauniti (english: distant but united) in #iorestoacasa on the left, #incapaci (english: inept) in #sciacalli in the middle, and #futuro (english: future) in #italylockdown on the right. clusters of co-occurring hashtags were obtained through spectral clustering [38] . these clusters highlight the co-occurrence of european-focused content, featuring hashtags like #bce (i.e. european central bank), #lagarde and #spread (i.e. spread between italian and german bonds) together with social distance practices related to #iorestoacasa. bottom: in mercurial, any link in a co-occurrence network of hashtags (left) corresponds to a collection of tweets whose words co-occur according to a word network (right). larger words have a higher closeness centrality. a co-occurring with hashtag b is informative about the frequency of the so-called 2-grams "ab" or "ba" but it does not consider how those hashtags co-occur with c, d, etc. in other words, a 2-gram captures the co-occurrence of two specific hashtags within tweets but does not provide the simultaneous structure of co-occurrences of all hashtags across tweets, for which a network of pairwise co-occurrences is required. on such a network, closeness can then highlight hashtags at short distance from all others, i.e. co-occurring in a number of contexts in the featured discourse. in addition to closeness, we also use graph distance entropy to measure centrality. this centrality measure captures which hashtags are uniformly closer to all other hashtags in a connected network. combining closeness with graph distance entropy led to successfully identifying words of relevance in conceptual networks with a few hundreds of nodes [32] . the main idea behind graph distance entropy is that it provides info about the spread of the distribution of network distances between nodes (i.e. shortest path), a statistical quantity that cannot be extracted from closeness (which is, conversely, a mean inverse distance). considering the set d (i) ≡ (d i1 , ..., d ij , ..., d in ) of distances between i and any other node j connected to it (1 ≤ j ≤ n ) and m i = m ax(d (i) ), then graph distance entropy is defined as: where p k is the probability of finding a distance equal to k. therefore, h(i) is a shannon entropy of distances and it ranges between 0 and 1. in general, the lower the entropy, the more a node resembles a star centre [34] and is at equal distances from all other nodes. thus, nodes with a lower h(i) and a higher closeness are more uniformly close to all other connected nodes in a network. words with simultaneously low graph distance entropy and high closeness were found to be prominent words for early word learning [34] and mindset characterisation [32] . in addition to hashtag networks, we also build word networks obtained from a collection of tweets containing any combination of the focal hashtags #iorestocasa or #sciacalli and #coronavirus. for all tweets containing a given set of hashtags, we performed the following: 1. subdivide the tweet in sentences and delete all stop-words from each sentence, preserving the original ordering of words; 2. stem all the remaining words, i.e. identify the root or stem composing a given word. in a language such as italian, in which there is a number of ways of adding suffixes to words, word stemming is essential in order to recognise the same word even when it is inflected for different gender, number or as a verb tense. for instance, abbandoneremo (we will abandon) and abbandono (abandon, abandonment) both represent the same stem abband; 3. draw links between a stemmed word and its subsequent one. store the resulting edge list of word co-occurrences. 4. sentences containing a negation (i.e. "not") underwent an additional step parsing their syntactic structure. this was done in order to identify the target of negation (e.g. in "this is not peace", the negation refers to "peace"). syntactic dependencies were not used for network construction but intervened in emotional profiling, instead (see below). the resulting word network also captures syntactic dependencies between words [16] related by online users to a specific hashtag or combination of hashtag. we used closeness centrality to detect the relevance of words for a given hashtag. text pre-processing such as word stemming and syntactic dependencies was performed using mathematica 11.3, which was also used to extract networks and compute network metrics. the presence of hashtags in word networks provided a way of linking words, which express common language, with hashtags, which express content but also summarise the topic of a tweet. consequently, by using this new approach, the meaning attributed by users to hashtags can be inferred not only from hashtag co-occurrence but also from word networks. an example of mercu-rial, featuring hashtag-hashtag and word-word co-occurrences, is reported in figure 1 (bottom). in this example, hashtags #coronavirus and #restiamoacasa co-occurred together (left) in tweets featuring many co-occurring words (right). the resulting word network shows relevant concepts such as "incoraggiamenti" (english: encouragement) and "problemi" (english: problems), highlighting a positive attitude towards facing problems related to the pandemic. more in general, the attribution and reconstruction of such meaning was explored by considering conceptual relevance and emotional profiling in one or several word networks related to a given region of a hashtag co-occurrence network. as a first data source for emotional profiling, this work also used valence and arousal data from warriner and colleagues [40] , whose combination can reconstruct emotional states according to the well-studied circumplex model of affect [41, 30] . in psycholinguistics, word valence expresses how positively/negatively a concept is perceived (equivalently to sentiment in computer science). the second dimension, arousal, indicates the alertness or lethargy inspired by a concept. having a high arousal and valence indicates excitement and joy, whereas a negative valence combined with a high arousal can result in anxiety and alarm [30] . finally, some studies also include dominance or potency as a measure of the degree of control experienced [40] . however, for reasons of conciseness, we focus on the two primary dimensions of affect: valence and arousal. going beyond the standard positive/negative/neutral sentiment intensity is of utmost importance for characterising the overall online perception of massive events [9] . beyond the primary affective dimension of sentiment, the affect associated with current events [12] can also be described in terms of arousal [42] and of basic emotions such as fear, disgust, anger, trust, joy, surprise, sadness, and anticipation. these emotions represent basic building blocks of many complex emotional states [23] , and they are all self-explanatory except for anticipation, which indicates a projection into future events [18] . whereas fear, disgust, and anger (trust and joy) elicit negative (positive) feedback, surprise, sadness and anticipation have been recently evaluated as neutral emotions, including both positive and negative feedback reactions to events in the external world [43] . to attribute emotions to individual words, we use the nrc lexicon [18] and the circumplex model [30] . these two approaches allow us to quantify the emotional profile of a set of words related to hashtags or combinations of hashtags. the nrc lexicon enlists words eliciting a given emotion. the circumplex model attributes valence and arousal scores to words, which in turn determine their closest emotional states. because datasets of similar size were not available for italian, the data from the nrc lexicon and the warriner norms were translated from english to italian using a forward consensus translation of google translate, microsoft bing and deepl translator, which was successfully used in previous investigations with italian [44] . although the valence of some concepts might change across languages [40] , word stemming related several scores to the same stem, e.g. scores for "studio" (english: "study") and "studiare" (english: "to study") were averaged together and the average attributed to the stem root "stud". in this way, even if non-systematic cross-language valence shifting introduced inaccuracy in the score for one word (e.g. "studiare"), averaging over other words relative to the same stem reduced the influence of such inaccuracy. no statistically significant difference (α = 0.05) was found between the emotional profiles of 200 italian tweets, including 896 different stems, and their automatic translations in english, holding for each dimension separately (z-scores < 1.96). then, we build emotional profiles by considering the distribution of words eliciting a given emotion/valence/arousal and associated to specific hashtags in tweets. assertive tweets with no negation were evaluated directly through a bag of words model, i.e. by directly considering the words composing them. tweets including negations underwent an additional intermediate step where words syntactically linked to the negation were substituted with their antonyms [45] and then evaluated. source-target syntactic dependencies were computed in mathematica 11.3 and all words targeted by a negation word (i.e. no, non and nessuno in italian) underwent the substitution with their antonyms. to determine whether the observed emotional intensity r(i) of a given emotion in a set s of words was compatible with random expectation, we perform a statistical test (z-test) using the nrc dataset. remember that emotional intensity here was measured in terms of richness or count of words eliciting a given emotion in a given network. as a null model, we use random samples as follows: let us denote by m the number of words stemmed from s that are also in the nrc dataset. then, m words from the nrc lexicon are sampled uniformly at random and their emotional profile is compared against that of the empirical sample. we repeated this random sampling 1000 times for each single empirical observed emotional profile {r(i)} i . to ensure the resulting profiles are indeed compatible with a gaussian distribution, we performed a kolmogorov-smirnov test (α = 0.05). all the tests we performed gave random distributions of emotional intensities compatible with a gaussian distribution, characterised by a mean random intensity for emotion i, r * (i) and a standard deviation σ * (i). for each emotion, a z-score was computed: in the remainder of the manuscript, every emotional profile incompatible with random expectation was highlighted in black or marked with a check. since we used a two-tailed z-test (with a significance level of 0.05), this means that an emotional richness can either be higher or lower than random expectation. the investigated corpus of tweets represents a complex multilevel system, where conceptual knowledge and emotional perceptions are entwined on a number of levels. tweets are made of text and include words, which convey meaning [31] . from the analysis of word networks, we can obtain information on the organisation of knowledge proper of social media users, which is embedded in their generated content [16] . however, tweets also convey meaning through the use of hashtags, which can either refer to specific words or point to the overall topic of the whole tweet. both words and hashtags can evoke emotions in different contexts, thus giving rise to complex patterns [17] . similar to words in natural language, the same hashtags can be perceived and used in language differently by different users, according to the context. the simultaneous presence of word-and hashtag-occurrences in tweets is representative of the knowledge shared by social media users when conveying specific content and ideas. this interconnected representation of knowledge can be exploited by simultaneously considering both hashtag-level and word-level information, since words specify the meaning attributed to hashtags. in this section we use mercurial to analyse the data collected. we do so by characterising the hashtag networks, both in terms of meaning and emotional profiles. precedence is given to hashtags as they not only convey meaning as individual linguistic units but also represent more general-level topics characterising the online discourse. then, we inter-relate hashtag networks with word networks. finally, we perform the emotional profiling of hashtags in specific contexts. the combination of word-and hashtag-networks specifies the perceptions embedded by online users around the same entities, e.g. coronavirus, in social discourses coming from different contexts. the largest connected components of the three hashtag networks included: 1000 hashtags and 8923 links for #italylockdown; 720 hashtags and 5915 links for #sciacalli; 6665 hashtags and 53395 links for #italylockdown. all three networks are found to be highly clustered (mean local clustering coefficient [38] of 0.82) and with an average distance between any two hashtags of 2.1. only 126 hashtags were present in all the three networks. table 1 reports the most central hashtags included in each corpus of tweets thematically revolving around #iorestoacasa, #sciacalli and #italylockdown. the ranking relies on closeness centrality, which in here quantifies the tendency for hashtags to co-occur with other hashtags expressing analogous concepts and, therefore, are at short network distance from each other (see methods). hence, hashtags with a higher closeness centrality represent the prominent concepts in the social discourse. this result is similar to those showing that closeness centrality captures concepts which are relevant for early word acquisition [39] and production [46] in language. additional evidence that closeness can capture semantically central concepts is represented by the closeness ranking, which assigns top-ranked positions to #coronavirus and #covid-19 in all three twitter corpora. this is a consequence of the corpora being about the covid-19 outbreak (and of the network metric being able to capture semantic relevance). in the hashtag network built around #italylockdown, the most central hashtags are relative to the coronavirus, including a mix of negative hashtags such as #pandemia (english: "pandemic") and positive ones such as #italystaystrong. similarly, the hashtag network built around #sciacalli highlighted both positive (#facciamorete (english: "let's network") and negative (#irresponsabili -english: "irresponsible") hashtags. however, the social discourse around #sciacalli also featured prominent hashtags from politics, including references to specific italian politicians, to the italian government, and hashtags expressing protest and shame towards the acts of a prominent italian politician. conversely, the social discourse around #iorestoacasa included many positive hashtags, eliciting hope for a better future and the need to act responsibly (e.g. #andratuttobene -english: "everything will be fine", or #restiamoacasa -english: "let's stay at home"). the most prominent hashtags in each network (cf. table 1 ) indicate the prevalence of a positive social discourse around #iostoacasa and the percolation of strong political debate in relation to the negative topics conveyed by #sciacalli. however, we want to extend these punctual observations of negative/positive valences of single hashtags to the overall global networks. to achieve this, we use emotional profiling. hashtags can be composed of individual or multiple words. by extracting individual words from the hashtags of a given network, it is possible to reconstruct the emotional profile of the social discourse around the focal hashtags #sciacalli, #italylockdown and #iorestoacasa. we tackle this by using the emotion-based [18] and the dimension based [30] emotional profiles (see methods). the emotional profiles of hashtags featured in co-occurrence networks are reported in figure 2 (top). the top section of the figure represents perceived valence and arousal represented as a circumplex model of affect [30] . this 2d space or disk is called emotional circumplex and its coordinates represent emotional states that are well-supported by empirical behavioural data and brain research [30] . as explained also in the figure caption, each word is endowed with an (x, y) table 1 : top-ranked hashtags in co-occurrence networks based on closeness centrality. higher ranked hashtags co-occurred with more topic-related concepts in the same tweet. in all three rankings, the most central hashtag was the one defining the topic (e.g. #italylockdown) and was omitted from the ranking. (-0.6,+0.6) represents a point of strong negative valence and positive arousal, i.e. alarm. figure 2 reports the emotional profiles of all hashtags featured in co-occurrence networks for #italylockdown (left), #sciacalli (middle) and #iorestoacasa (right). to represent the interquartile range of all words for which valence/arousal rating are available, we use a neutrality range. histograms falling outside of the neutrality range indicate specific emotional states expressed by words included within hashtags (e.g. #pandemia contains the word "pandemia" with negative valence and high arousal). in figure 2 (left, top), the peak of the emotional distribution for hashtags associated with #italylockdown falls within the neutrality range. this finding indicates that hashtags co-occurring with #italylockdown, a neutral hashtag by itself, were also mostly emotionally neutral conceptual entities. despite this main trend, the distribution also features deviations from the peak mostly in the areas of calmness and tranquillity (positive valence, lower arousal) and excitement (positive valence, higher arousal). weaker deviations (closer to the neutrality range) were present also in the area of anxiety. this reconstructed emotional profile indicates that the italian social discourse featuring #italylockdown was mostly calm and quiet, perceiving the lockdown as a positive measure for countering responsibly the covid-19 outbreak. not surprisingly, the social discourse around #sciacalli shows a less prominent positive emotional profile, with a higher probability of featuring hashtags eliciting anxiety, negative valence and increased states of arousal, as it can be seen in figure 2 (center, top) . this polarised emotional profile represents quantitative evidence for the coexistence of mildly positive and strongly negative content within the online discourse labelled by #sciacalli. this is further evidence that the negative hashtag #sciacalli was indeed used by italian users to denounce or raise alarm over the negative implications of the lockdown, especially in relation to politics and politicians' actions. however, the polarisation of political content and debate over social media platforms has been encountered in many other studies [21, 13, 12] and cannot be attributed to the covid-19 outbreak only. finally, figure 2 (top right) shows that positive perception was more prominently reflected in the emotional profile of #iorestoacasa, which was the hashtag massively promoted by the italian government for supporting the introduction of the nationwide lockdown in italy. the emotional figure 2 : emotional profiles of all hashtags featured in co-occurrence networks for #italylockdown (left), #sciacalli (middle) and #iorestoacasa (right). top: circumplex emotional profiling. all hashtags representing one or more words were considered. for each word, valence (x-coordinate) and arousal (y-coordinate) scores were attributed (see methods) resulting in a 2d density histogram (yellow overlay) relative to the probability of finding an hashtag in a given location in the circumplex, the higher the probability the stronger the colour. regions with the same probabilities are enclosed in grey lines. a neutrality range indicates where 50% of the words in the underlying valence/arousal dataset would fall and it thus serves as a reference value for detecting abnormal emotional profiles. distributions falling outside of this range indicate deviations from the median behaviour (i.e. interquartile range, see methods). bottom: nrc-based emotional profiling, detecting how many hashtags inspired a given emotion in a hashtag network. results are normalised over the total number of hashtags in a networks. emotions compatible with random expectation were highlighted in gray. profile of the 6000 hashtags co-occurring with #iorestoacasa indicate a considerably positive and calm perception of domestic confinement, seen as a positive tool to stay safe and healthy. the prominence of hopeful hashtags in association with #iorestoacasa, as reported in the previous subsection, indicate that many italian twitter users were serene and hopeful about staying at home at the start of lockdown. hashtag networks were emotionally profiled not only by using the circumplex model (see above) but also by using basic emotional associations taken from the nrc emotion lexicon (figure 2, bottom) . across all hashtag networks, we find a statistically significant peak 3 in trust, analogous of the peaks close to emotions of calmness and serenity an observed in the circumplex models. however, all the hashtag networks included also negative emotions like anger and fear, which are natural human responses to unknown threats and were observed also with the circumplex representations. the intensity of fearful, alarming and angry emotions is stronger in the #sciacalli hashtag network, which was used by social users to denounce, complain and express alertness about the consequences of the lockdown. in addition to the politically-focused jargon highlighted by closeness centrality alone, by combining closeness with graph distance entropy (see methods and [32] ) we identify other topics which are uniformly at short distance from others in the social discourse around #sciacalli, such as: #mascherine (english: "protective masks", which was also ranked high by using closeness only), #amuchina (the most popular brand, and synonym of, hand sanitiser), #supermercati (english: "supermarkets"). this result suggests an interesting interpretation of the negative emotions around #sciacalli. beside the inflaming political debate and the fear of the health emergency, in fact, a third element emerges: italian twitter users feared and were angry about the raiding and stockpiling of first aid items, symptoms of panic-buying in the wake of the lockdown. the above comparisons indicate consistency between dimension-based (i.e. the circumplex) and emotion-specific emotional profiling. since the latter offers also a more precise categorisation of words in emotions, we will focus on emotion-specific profiling. importantly, to fully understand the emotional profiles outlined above, it is necessary to identify the language expressed in tweets using a given combination of hashtags (see also figure 1 , bottom). as the next step of the mercurial analysis, we gather all tweets featuring the focal hashtags #italylockdown, #sciacalli, or #iorestoacasa and any of their co-occurring hashtags and build the corresponding word networks, as explained in the methods. closeness centrality over these networks provided the relevance of each single word in the social discourse around the topic identified by a hashtag. only words with closeness higher than the median were reported. figure 3 shows the cloud of words appearing in all tweets that include #sciacalli, displayed according to their nrc emotional profile. similar to the emotional profile extracted from hashtags co-occurring with #sciacalli, the words used in tweets with this hashtag also display a polarised emotional profile with high levels of fear and trust. thanks to the multi-layer analysis, this dichotomy can now be better understood in terms of the individual concepts eliciting it. by using closeness on word networks, we identified concepts such as "competente" (english: "competent"), "continua" (english: "continue", "keep going"), and "comitato" (english: "committee") to be relevant for the trust-sphere. these words convey trust in the expert committees appointed by the italian government to face the pandemic and protect the citizens. we find that other prominent words contributing to make the discourse around #sciacalli trustful are "aiutare" (english: "to help"), "serena" (english: "serene"), "rispetto" (english: "respect") and "veritãă" (english: "truth"), which further validate a trustful, open-minded and fair perception of the political and emergency debate outlined above. this perception was mixed with negative elements, mainly eliciting fear but also sadness and anger. the jargon of a political debate emerges in the word cloud of fear: "dif-ficoltãă" (english: "difficulty"), "criminale" (english: "criminal"), "dannati" (english: "scoundrel"), "crollare" (english: "to break down"), "banda" (english: "gang"), "panico" (english: "panic") and "caos" (english: "chaos"). these words indicate that twitter users felt fear directed to specific targets. a speculative explanation for exorcising fear can be finding a scapegoat and then target it with anger. the word cloud of such emotion supports the occurrence of such phenomenon by featuring words like "denuncia" (english: "denouncement"), "colpevoli" (english: "guilty"), "vergogna" (english: "shame"), "combattere" (english: "to fight") and "colpa" (english: "blame"). the above words are reflected also in other emotions like sadness, which features also words like "cadere" (english: "to fall") and "miseria" (english: "misery", "out of grace"). these prominent words in the polarised emotional profile of #sciacalli, suggest that twitter users feared criminal behaviour, possibly related to unwise political debates or improper stockpiling of supplies (as showed by the hashtag analysis). our findings also suggest that the reaction to such fearful state, which also projects sadness about negative economic repercussions, was split into a strong, angry denounce of criminal behaviour and messages of trust for the order promoted by competent organisations and committees. it is interesting to note that, according to ekman's theory of basic emotions [23] , a combination of sadness and fear can be symptomatic of desperation, which is a critical emotional state for people in the midst of a pandemic-induced lockdown. the same analysis is reported in figure 4 for the social discourse of #italylockdown (top) and #iorestoacasa (bottom). in agreement with the circumplex profiling, for both #italylockdown and #iorestoacasa the intensity of fear is considerably lower than trust. however, when investigated in conjunction with words, the overall emotional profile of #italylockdown appears to be more positive, displaying higher trust and joy and lower sadness, than the emotional profile of #iorestoacasa. although the difference is small, this suggests that hashtags alone are not enough to fully characterise the perception of a conceptual unit, and should always be analysed together with the natural language associated to them. figure 3 : emotional profile and word cloud of the language used in tweets with #sciacalli. words are organised according to the emotion they evoke. font size is larger for words of higher closeness centrality in the word co-occurrence network relative to the hashtag (see methods). every emotion level incompatible with random expectation is highlighted with a check mark. the trust around #italylockdown comes from concepts like "consigli" (english: "tips", "advice"), "compagna" (english: "companion", "partner"), "chiara" (english: "clear"), "abbracci" (english:"hugs") and "canta" (english: "sing"). these words and the positive emotions they elicit suggest that italian users reacted to the early stages of the lockdown with a pervasive sense of commonality and companionship, reacting to the pandemic with externalisations of positive outlooks for the future, e.g. by playing music on the balconies 4 . interestingly, this positive perception co-existed with a more complex and nuanced one. despite the overall positive reaction, in fact, the discourse on #italylockdown also shows fear for the difficult times facing the contagion ("contagi") and the lockdown restrictions ("restrizioni"), and also anger, identifying the current situation as a fierce battle ("battaglia") against the virus. the analysis of anticipation, the emotional state projecting desires and beliefs into the future, shows the emergence of concepts such as "speranza" (english: "hope"), "possibile" (english: "possible") and "domani" (english: "tomorrow"), suggesting a hopeful attitude towards a better future. the social discourse around #iorestoacasa brought to light a similar emotional profile, with a slightly higher fear towards being quarantined at home (quarantena (english: "quarantine"), comando (english: "command", "order", emergenza (english: "emergency"). both surprise and sadness were elicited by the the word "confinamento" (english: "confinement"), which was prominently featured in the network structure arising from the tweets we analysed. in summary, the above emotional profiles of hashtags and words from the 101,767 tweets suggest figure 4 : emotional profile and word cloud of the language used in tweets with #italylockdown (top) and #iorestoacasa (bottom). words are organised according to the emotion they evoke. font size is larger for words of higher closeness centrality in the word co-occurrence network relative to the hashtag (see methods). every emotional richness incompatible with random expectation is highlighted with a check mark. that italians reacted to the lockdown measure with: 1. a fearful denounce of criminal acts with political nuances and sadness/desperation about negative economic repercussions (from #sciacalli); 2. positive and trustful externalisations of fraternity and affect, combined with hopeful attitudes towards a better future (from #italylockdown and #iorestoacasa); 3. a mournful concern about the psychological weight of being confined at home, inspiring sadness and disgust towards the health emergency (from #iorestoacasa). in the previous section we showed our findings on how italians perceived the early days of lockdown on social media. but what about their perception of the ultimate cause of such lockdown, covid-19? to better reconstruct the perception of #coronavirus, it is necessary to consider the different contexts where this hashtag occurs. figure 5 displays the reconstruction of the emotional profile of words used in tweets with #coronavirus and either #italylockdown, #sciacalli, or #iorestoacasa. our results suggest that the emotional profiles of language used in these three categories of tweets are different. for example, when considering tweets including #sciacalli, which the previous analysis revealed being influenced by political and social denounces of criminal acts, #coronavirus is perceived with a more polarised fear/trust dichotomy. although #coronavirus was perceived as trustful as random expectations when co-occurring with #sciacalli (z-score: 1.69 < 1.96), it was perceived with significantly higher trust when appearing in tweets with #iorestoacasa (z-score: 3.05 > 1.96) and #italylockdown (z-score: 3.51 > 1.96). to reinforce this picture, the intensity of fear towards #coronavirus was statistically significantly lower than random expectation in the discourse of #iorestoacasa (z-score: -2.35 < -1.96) and #italylockdown (z-score: -3.01 < -1.96). this difference is prominently reflected in both the circumplex model ( figure 5 , right) and the nrc emotional profile ( figure 5, left) , although in the latter both emotional intensities are compatible with random expectation. these quantitative comparisons provide data-driven evidence that twitter users perceived the same conceptual entity, i.e. covid-19, with a higher trust when associating it to concrete means for hampering pathogen diffusion like lockdown and house confinement, and with a higher fear when denouncing the politics and economics behind the pandemic. however, social distancing, lockdown and house confinement clearly do not have only positive sides. rather, as suggested by our analysis, they bear complex emotional profiles, where sadness, anger and fear towards the current situation and future developments have been prominently expressed by italians on social media. this study delved into the massive information flow of italian social media users in reaction to the declaration of the pandemic status of covid-19 by who, and the announcement of the nationwide lockdown by the italian government in the first half of march 2020. we explored the emotional profiles of italians during this period by analysing the social discourse around the official lockdown hashtag promoted by the italian government (#iorestoacasa), together with a most trending hashtag of social protest (#sciacalli), and a general hashtag about the lockdown (#italylockdown). the fundamental premise of this work is that social media opens a window on the minds of millions of people [17] . monitoring social discourse on online platforms provides unprecedented opportunities for understanding how different categories of people react to real world events [9, 12, 15] . here we introduced a new framework, multi-layer co-occurrence networks for emotional profiling (mercurial), which is based on cognitive network science and that allowed us to: (i) quantitatively structure social discourse as a multi-layer network of hashtag-hashtag and word-word co-occurrences in tweets; (ii) identify prominent discourse topics through network metrics backed up by cognitive interpretation [14] ; (iii) reconstruct and cross-validate the emotional profile attributed to each hashtag or topic of conversation through the emotion lexicon and the circumplex model of affect from social psychology and cognitive neuroscience [30] . our interdisciplinary framework provides a first step in combining network and cognitive science principles to quantify sentiment for specific topics. our analysis also included extensive robustness checks (e.g. selecting words based on different centrality measures, statistical testing for emotions), further highlighting the potential of the framework. the analysis of concept network centrality identified hashtags of political denounce and protest against irrational panic buying (e.g. face masks and hand sanitiser) around #sciacalli but not in the hashtag networks for #italylockdown and #iorestoacasa. our results also suggest that the social discourse around #sciacalli was further characterised by fear, anger, and trust, whose emotional intensity was significantly stronger than random expectation. we also found that the most prominent concepts eliciting these emotions revolve around social denounce (anger), concern for the collective well-being (fear), and the measures implemented by expert committees and authorities (hope). this interpretation is supported also by plutchik's wheel of emotions [22] , according to which combinations of anger, disgust and anticipation can be symptoms of aggressiveness and contempt. however, within plutchik's wheel, trust and fear are not in direct opposition. the polarisation of positive/negative emotions observed around #sciacalli might be a direct consequence of a polarisation of different social users with heterogeneous beliefs, which is a phenomenon present in many social systems [21] but is also strongly present in social media through the creation of echo chambers enforcing specific narratives and discouraging the discussion of opposing views [13, 2, 47, 11, 10] . emotional polarisation might therefore be a symptom of a severe lack of social consensus across italian users in the early stages of the lockdown induced by covid-19. in social psychology, social consensus is a self-built perception that the beliefs, feelings, and actions of others are analogous to one's own [48] . destabilising this perception can have detrimental effects such as reducing social commitment towards public good or even lead to a distorted perception of society, favouring selfdistrust and even conditions such as social anxiety [48] . instead, acts such as singing from the balconies together can reduce fear and enhance self-trust [42] , as well as promote commitment and social bonding [49] , which is also an evolutionary response to help coping with a threat, in this case a pandemic, through social consensus. when interpreted under the lens of social psychology, the flash mobs documented by traditional media and identified here as relevant by semantic network analysis for #italylockdown and #iorestoacasa become important means of facing the distress induced by confinement [48, 42, 49 ]. anger and fear permeated not only #sciacalli but were found, to a lesser extent, also in association with other hashtags such as #iorestoacasa or #italylockdown. recent studies (cf. [50] ) found that anger and fear can drastically reduce individuals' sense of agency, a subjective experience of being in control of our own actions, linking this behavioural/emotional pattern also to alteration in brain states. in turn, a reduced sense of agency can lead to losing control, potentially committing violent, irrational acts [50] . consequently, the strong signals of anger and fear detected here represent red flags about a building tension manifested by social users which might contribute to the outbreak of violent acts or end up in serious psychological distress due to lowered self-control. one of the most direct implications of the detected strong signals of fear, anger and sadness is represented by increased violent behaviour. in cognitive psychology, the general aggression model (gam) [51] is a well-studied model for predicting and understanding violent behaviour as the outcome of a variety of factors, including personality, situational context and the personal internal state of emotion and knowledge. according to gam, feeling emotions of anger in a situation of confinement can strongly promote violent behaviour. in italy, the emotions of anger and anxiety we detected through social media are well reflected in the dramatic rise in reported cases of domestic violence. for instance, the anti-violence centers of d.i.re (donne in rete contro la violenza) reported an anomalous increase of +74.5% in the number of women looking for help for domestic violence in march 2020 in italy 5 . hence, monitoring social media can be insightful about potential tensions mediated and discussed by large populations, a topic in need for further research and with practical prominent repercussions for fighting covid-19. as discussed, we found the hashtag #coronavirus to be central across all considered hashtag networks. however, our analysis outlined different emotional nuances of #coronavirus across different networks. in psycholinguistics, contextual valence shifting [52] is a well-known phenomenon whereby the very same conceptual unit can be perceived wildly differently by people according to its context. this phenomenon suggests the importance of considering words in a contextual manner, by comparison to each other, as it was performed in this study, rather than alone. indeed, contexts can change the meaning and emotional perception of many words in language. we showed here that the same connotation shifting phenomenon [52] can happen also for hashtags. online users perceived #coronavirus with stronger intensities of trust and lower fear (than random expectation) when using that hashtag in the context of #iorestoacasa and #italylockdown, but not when associated to #sciacalli. this shifting underlines the importance of considering contextual information surrounding a hashtag in order to better interpret its nuanced perception. to this aim, cognitive networks represent a powerful tool, providing quantitative metrics (such as graph distance entropy) that would be otherwise not applicable with mainstream frequency approaches in psycholinguistics. mercurial facilitates a quantitative characterisation of the emotions attributed to hashtags and discourses. nonetheless, it is important to bear in mind that the analysis we conducted relies on some assumptions and limitations. for instance, following previous work [12] , we built unweighted and undirected networks, neglecting information on how many times hashtags co-occurred. including these weights would be important for detecting communities of hashtags, beyond network centrality. notice that including weights would come at the cost of not being able to use graph distance entropy, which is defined over unweighted networks and was successfully used here for exposing the denounce of panic buying in #sciacalli. another limitation is relative to the emotional profiling performed with the nrc lexicon, in which the same word can elicit multiple emotions. since we measured emotional intensity by counting words eliciting a given emotion (plus the negations, see methods), a consequence was the repetition of the same words across the sectors of the above word clouds. building or exploiting additional data about the predominance of a word in a given emotion would enable us to identify words which are peripheral to a given emotion, reduce repetitions and offer even more detailed emotional profiles. recently, forma mentis networks [29, 32] have been introduced as a method to detect the organisation of positive/negative words in the mindsets of different individuals. a similar approach might be followed for emotions in future research. acting upon specific emotions rather than using the circumplex model would also solve another problem, in that the attribution of arousal to individual words is prone to more noise, even in mega-studies, compared to detecting word valence [53] . another limitation is that emotional profiles might fluctuate over time. the insightful results outlined and discussed here were aggregated over a short time window, thus reducing the impact of aggregation itself. future analyses on longer time windows should adopt time-series for investigating emotional patterns, addressing key issues like non-stationary tweeting patterns over time and statistical scarcity due to tweet crawling (see also [12] ). the current analysis has focused on aggregated tweets, but previous studies have shown both stable individual and intercultural differences in affect [54] , especially for dimensions such as arousal. similarly, some emotions are harder to measure than others, which might affect reliability and thus underestimate their contribution. the current approach estimates emotional profiles on the basis of a large set of words, which will reduce some language-specific differences. the collection of currently missing large-scale italian normative datasets for lexical sentiment could further improve the accuracy of the findings. this study approaches the relation between emotions and mental distress mostly from the perspective that attitudes and emotions of the author are conveyed in the linguistic content. however, the emotion profile might also have implications for readers as well, as recent research suggests that even just reading words of strong valence/arousal can have deep somatic and visceral effects, e.g. raising heart beat or promoting involuntary muscle tension [55] . furthermore, authors and readers participate in an information network, and quantifying which tweets are liked or retweeted depending on the structure of social network can provide further insight on their potential impact [12, 21, 10, 4, 56] , which calls for future approaches merging social networks, cognitive networks and emotional profiling. finally, understanding the impact of nuanced emotional appraisals would also benefit from investigating how these are related to behavioural and societal outcomes including the numbers of the contagion (e.g. hospitalisations, death rate, etc.) and compliance with physical distancing [57] . given the massive attention devoted to the covid-19 pandemic by social media, monitoring online discourse can offer an insightful thermometer of how individuals discussed and perceived the pandemic and the subsequent lockdown. our mercurial framework offered quantitative readings of the emotional profiles among italian twitter users during early covid-19 diffusion. the detected emotional signals of political and social denounce, the trust in local authorities, the fear and anger towards the health and economic repercussions, and the positive initiatives of fraternity, all outline a rich picture of emotional reactions from italians. importantly, the psychological interpretation of mercurial's results identified early signals of mental health distress and antisocial behaviour, both linked to violence and relevant for explaining increments in domestic abuse. future research will further explore and consolidate the behavioural implications of online cognitive and emotional profiles, relying on the promising significance of our current results. our cognitive network science approach offers decision-makers the prospect of being able to successfully detect global issues and design timely, data-informed policies. especially under a crisis, when time constraints and pressure prevent even the richest and most organised governments from fully understanding the implications of their choices, an ethical and accurate monitoring of online discourses and emotional profiles constitutes an incredibly powerful support for facing global threats. m.s. acknowledges daniele quercia, nicola perra and andrea baronchelli for stimulating discussion. how to fight an infodemic the covid-19 social media infodemic assessing the risks of "infodemics" in response to covid-19 epidemics covid-19 infodemic: more retweets for science-based information on coronavirus than for false information immediate psychological responses and associated factors during the initial stage of the 2019 coronavirus disease (covid-19) epidemic among the general population in china world health organization. mental health during covid-19 outbreak the immediate mental health impacts of the covid-19 pandemic among people with or without quarantine managements a longitudinal study on the mental health of general population during the covid-19 epidemic in china quantifying the effect of sentiment on information diffusion in social media phase transitions in information spreading on structured populations beating the news using social media: the case study of american idol bots increase exposure to negative and inflammatory content in online social systems exposure to opposing views on social media can increase political polarization cognitive network science: a review of research on cognition through the lens of network representations, processes, and dynamics text-mining forma mentis networks reconstruct public perception of the stem gender gap in social media probing the topological properties of complex networks modeling short written texts our twitter profiles, our selves: predicting personality with twitter emotions evoked by common words and phrases: using mechanical turk to create an emotion lexicon semeval-2018 task 1: affect in tweets measuring emotions in the covid-19 real world worry dataset a complex network approach to political analysis: application to the brazilian chamber of deputies the emotions. university press of america the nature of emotion: fundamental questions emotional contagion the ripple effect: emotional contagion and its influence on group behavior experimental evidence of massivescale emotional contagion through social networks the rippling dynamics of valenced messages in naturalistic youth chat emotions and social movements: twenty years of theory and research forma mentis networks quantify crucial differences in stem perception between students and experts the circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology largescale network representations of semantics in the mental lexicon forma mentis networks map how nursing and engineering students enhance their mindsets about innovation and health during professional growth from topic networks to distributed cognitive maps: zipfian topic universes in the area of volunteered geographic information distance entropy cartography characterises centrality in complex networks covid-19 labelled network subgraphs reveal stylistic subtleties in written texts predicting lexical norms: a comparison between a word association model and text-based word co-occurrence models modelling early word acquisition through multiplex lexical networks and machine learning norms of valence, arousal, and dominance for 13,915 english lemmas a circumplex model of affect the effects of group singing on mood affect regulation, mentalization and the development of the self forma mentis networks reconstruct how italian high schoolers and international stem experts perceive teachers, students, scientists, and school wordnet: an electronic lexical database the multiplex structure of the mental lexicon influences picture naming in people with aphasia recursive patterns in online echo chambers on the perception of social consensus the ice-breaker effect: singing mediates fast social bonding i just lost it! fear and anger reduce the sense of agency: a study using intentional binding the general aggression model: theoretical extensions to violence contextual valence shifters obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words the relation between valence and arousal in subjective experience varies with personality and culture somatic and visceral effects of word valence, arousal and concreteness in a continuum lexical space retweeting for covid-19: consensus building, information sharing, dissent, and lockdown life the ids of the tweets analysed in this study are available on the open science foundation repository: https://osf.io/jy5kz/. key: cord-273941-gu6nnv9d authors: chandran, uma; mehendale, neelay; patil, saniya; chaguturu, rathnam; patwardhan, bhushan title: chapter 5 network pharmacology date: 2017-12-31 journal: innovative approaches in drug discovery doi: 10.1016/b978-0-12-801814-9.00005-2 sha: doc_id: 273941 cord_uid: gu6nnv9d abstract the one-drug/one-target/one-disease approach to drug discovery is presently facing many challenges of safety, efficacy, and sustainability. network biology and polypharmacology approaches gained appreciation recently as methods for omics data integration and multitarget drug development, respectively. the combination of these two approaches created a novel paradigm called network pharmacology (np) that looks at the effect of drugs on both the interactome and the diseasome level. ayurveda, the traditional system of indian medicine, uses intelligent formulations containing multiple ingredients and multiple bioactive compounds; however, the scientific rationale and mechanisms remain largely unexplored. np approaches can serve as a valuable tool for evidence-based ayurveda to understand the medicines’ putative actions, indications, and mechanisms. this chapter discusses np and its potential to explore traditional medicine systems to overcome the drug discovery impasse. drug discovery, the process by which new candidate medications are discovered, initially began with random searching of therapeutic agents from plants, animals, and naturally occurring minerals (burger, 1964) . for this, they depended on the materia medica that was established by medicine men and priests from that era. this was followed by the origin of classical pharmacology in which the desirable therapeutic effects of small molecules were tested on intact cells or whole organisms. later, the advent of human genome sequencing revolutionized the drug discovery process that developed into targetàbased drug discovery, also known as reverse pharmacology. this relies on the hypothesis that the modulation of the activity of a specific protein will have therapeutic effects. the protein that the drug binds to or interacts with is also referred to as a "target." in this reductionist approach, small molecules from a chemical library are screened for their effect on the target's known or predicted function (hacker et al., 2009) . once the small molecule is selected for a particular target, further modifications are carried out at the atomic level to ameliorate the lock-and-key interactions. this one-drug/onetarget/one-therapeutic approach was followed for the last several decades. the information technology revolution at the end of 20th century metamorphosed the drug discovery process as well (clark and pickett, 2000) . advancements in omics technologies during this time were used to develop strategies for different phases of drug research (buriani et al., 2012) . computational power was implemented in the discovery process for predicting a drug-likeness of newly designed or discovered compounds and ligandprotein docking for predicting the binding affinity of a small molecule with a protein three-dimensional structure. in silico tools were developed to predict other pharmacological properties of the drug molecules such as absorption, distribution, metabolism, excretion, and toxicity-abbreviated together as admet (van de waterbeemd and gifford, 2003; clark and grootenhuis, 2002) . the technological advancements triggered discovery efforts in a direction to discover more specific magic bullets that were completely against the holistic approach of traditional medicine. this magic bullet approach is currently in decline phase. the major limitations of this drug discovery approach are side effects and the inability to tackle multifactorial diseases. this is mainly due to the linearity of this approach. during the peak, historical time of drug discovery and development of natural productsàbased drugs had played a significant role due to their superior chemical diversity and safety over synthetic compound libraries (zimmermann et al., 2007) . currently, it is estimated that more than one hundred new, natural productàbased leads are in clinical development (harvey, 2008) . many active compounds (bioactives) from traditional medicine sources could serve as good starting compounds and scaffolds for rational drug design. natural products normally act through modulation of multiple targets rather than a single, highly specific target. but in drug discovery and development, technology was used to synthesize highly specific mono-targeted molecules that mimic the bioactives from natural compounds rather than understanding the rationale behind their synergistic action and developing methods to isolate the bioactives from natural resources. researchers understand that most diseases are due to dysfunction of multiple proteins. thus, it is important to address multiple targets emanating from a syndrome-related, metabolic cascade, so that holistic management can be effectively achieved. therefore, it is necessary to shift the strategy from one that focuses on a single-target, new chemical entity to one of a multiple-target, synergistic, formulation-discovery approach . this tempted the research world to go back and extensively explore natural sources, where modern pharmacology had begun. this renewed research focus indicates the need to rediscover the drug discovery process by integrating traditional knowledge with state-of-the-art technologies (patwardhan, 2014a) . a new discipline called network pharmacology (np) has emerged which attempts to understand drug actions and interactions with multiple targets (hopkins, 2007) . it uses computational power to systematically catalogue the molecular interactions of a drug molecule in a living cell. np appeared as an important tool in understanding the underlying complex relationships between botanical formula and the whole body berger and iyengar, 2009) . it also attempts to discover new drug leads and targets and to repurpose existing drug molecules for different therapeutic conditions by allowing an unbiased investigation of potential target spaces (kibble et al., 2015) . however, these efforts require some guidance for selecting the right type of targets and new scaffolds of drug molecules. traditional knowledge can play a vital role in this process of formulation discovery and repurposing existing drugs. by combining advances in systems biology and np, it might be possible to rationally design the next generation of promiscuous drugs (cho et al., 2012; hopkins, 2008; ellingson et al., 2014) . np analysis not only opens up new therapeutic options, but it also aims to improve the safety and efficacy of existing medications. the postgenomic era witnessed a rapid development of computational biology techniques to analyze and explore existing biological data. the key aim of the postgenomic biomedical research was to systematically catalogue all molecules and their interactions within a living cell. it is essential to understand how these molecules and the interactions among them determine the function of this immensely complex machinery, both in isolation and when surrounded by other cells. this led to the emergence and advancement of network biology, which indicates that cellular networks are governed by universal laws and offer a new conceptual framework that could potentially revolutionize our view of biology and disease pathologies in the 21st century (barabási and oltvai, 2004) . during the first decade of the 21st century, several approaches for biological network construction were put forward that used computational methods, and literature mining especially, to understand the relation between disease phenotypes and genotypes. as a consequence, lmma (literature mining and microarray analysis), a novel approach to reconstructing gene networks by combining literature mining and microarray analysis, was proposed (li et al., 2006; huang and li, 2010) . with this, a global network was first derived using the literatureàbased, cooccurrence method and then refined using microarray data. the lmma biological network approach enables researchers to keep themselves up to date with relevant literature on specialized biological topics and to make sense of the relevant large-scale microarray dataset. also, lmma serves as a useful tool for constructing specific biological network and experimental design. lmmaàlike representations enable a systemic recognition for the specific diseases in the context of complex gene interactions and are helpful for studying the regulation of various complex biological, physiological, and pathological systems. the significance of accumulated-data integration was appreciated by pharmacologists, and they began to look beyond the classic lock-and-key concept as a far more intricate picture of drug action became clear in the postgenomic era. the global mapping of pharmacological space uncovered promiscuity, the specific binding of a chemical to more than one target (paolini et al., 2006) . as there can be multiple keys for a single lock, in the same way, a single key can fit into multiple locks. similarly, a ligand might interact with many targets and a target may accommodate different types of ligands. this is referred to as "polypharmacology." the concept of network biology was used to integrate data from drugbank (re and valentini, 2013) and omim (hamosh et al., 2005) , an online catalog of human genes and genetic disorders to understand the industry trends, the properties of drug targets, and to study how drug targets are related to disease-gene products. in this way, when the first drug-target network was constructed, isolated and bipartite nodes were expected based on the existed one-drug/one-target/onedisease approach. rather, the authors observed a rich network of polypharmacology interactions between drugs and their targets (yildirim et al., 2007 ). an overabundance of "follow-on" drugs that are drugs that target already targeted proteins was observed. this suggested a need to upgrade the singletarget single-drug paradigm, as single-protein single-function relations are limited to accurately describing the reality of cellular processes. advances in systems biology led to the realization that complex diseases cannot be effectively treated by intervention at single proteins. this made the drug researchers accept the concept of polypharmacology which they previously thought as an undesirable property that needs to be removed or reduced to produce clean drugs acting on single-targets. according to network biology, simultaneous modulation of multiple targets is required for modifying phenotypes. developing methods to aid polypharmacology can help to improve efficacy and predict unwanted off-target effects. hopkins (hopkins, 2007 (hopkins, , 2008 observed that network biology and polypharmacology can illuminate the understanding of drug action. he introduced the term "network pharmacology." this distinctive new approach to drug discovery can enable the paradigm shift from highly specific magic bulletàbased drug discovery to multitargeted drug discovery. np has the potential to provide new treatments to multigenic complex diseases and can lead to the development of e-therapeutics where the ligand formulation can be customized for each complex indication under every disease type. this can be expanded in the future and lead to customized and personalized therapeutics. integration of network biology and polypharmacology can tackle two major sources of attrition in drug development such as efficacy and toxicity. also, this integration holds the promise of expanding the current opportunity space for druggable targets. hopkins proposed np as the next paradigm in drug discovery. polypharmacology expands the space in drug discovery approach. hopkins had suggested three strategies to the designers of multitarget therapies: the first was to prescribe multiple individual medications as a multidrug combination cocktail. patient compliance and the danger of drugàdrug interactions would be the expected drawbacks of this method. the second proposition was the development of multicomponent drug formulations. the change in metabolism, bioavailability, and pharmacokinetics of formulation as well as safety would be the major concerns of this approach. the third strategy was to design a single compound with selective polypharmacology. according to hopkins, the third method is advantageous, as it would ease the dosing studies. also, the regulatory barriers for the single compound are fewer compared to a formulation. an excellent example of this is metformin, the first-line drug for type ii diabetes that has been found to have cancerinhibiting properties (leung et al., 2013) . the following years witnessed the application research of np by integrating network biology and polypharmacology. a computational framework, based on a regression model that integrates human proteinàprotein interactions, disease phenotype similarities, and known geneàphenotype associations to capture the complex relationships between phenotypes and genotypes, has been proposed. this was based on the assumption that phenotypically similar diseases are caused by functionally related genes. a tool named cipher (correlating protein interaction network and phenotype network to predict disease genes) has been developed that predicts and prioritizes disease-causing genes (wu et al., 2008) . cipher helps to uncover known disease genes and predict novel susceptibility candidates. another application of this study is to predict a human disease landscape that can be exploited to study the related genes for related phenotypes that will be clustered together in a molecular interaction network. this will facilitate the discovery of disease genes and help to analyze the cooperativity among genes. later, cipher-hit, a hitting-time-based method to measure global closeness between two nodes of a heterogeneous network, was developed (yao et al., 2011) . a phenotypeàgenotype network can be explored using this method for detecting the genes related to a particular phenotype. a net-workàbased gene clustering and extension were used to identify responsive gene modules in a conditionàspecific gene network aimed to provide useful resources to understand physiological responses (gu et al., 2010) . np was also used to develop mirnaàbased biomarkers (lu et al., 2011) . for this, a network of mirna and their targets was constructed and further refined to study the data for specific diseases. this process integrated with literature mining was useful to develop potent mirna markers for diseases. np was also used to develop a drug geneàdisease comodule (zhao and li, 2012) . initially, a drug-disease network was constructed by information gathered from databases followed by the integration of gene data. the gene closeness was studied by developing a mathematical model. this network inferred the association of multiple genes for most of the diseases and target sharing of drugs and diseases. these kinds of networks give insight into new drug-disease associations and their molecular connections. during the progression period of network biology, natural products were gaining importance in the chemical space of drug discovery, as these have been economically designed and synthesized by nature for the benefit of evolution (wetzel et al., 2011) . researchers began analyzing the logic behind traditional medicine systems and devised computational ways to ease the analysis. a comprehensive herbal medicine information system that was developed integrates information of more than 200 anticancer herbal recipes that have been used for the treatment of different types of cancer in the clinic, 900 individual ingredients, and 8500 small organic molecules isolated from herbal medicines (fang et al., 2005) . this system, which was developed using an oracle database and internet technology, facilitates and promotes scientific research in herbal medicine. this was followed by the development of many databases that serve as a source of botanical information and a powerful tool that provides a bridge between traditional medicines and modern molecular biology. these kinds of databases and tools made the researchers conceive the idea of np of botanicals and their formulations to understand the underlying mechanisms of traditional medicines. we refer to such networks as "ethnopharmacological networks" and the technique as "network ethnopharmacology (nep)" (patwardhan and chandran, 2015) . shao li pioneered this endeavor and proposed this network as a tool to explain the zheng (syndrome of traditional chinese medicine (tcm)) and the multiple-targets' mechanism of tcm (li, 2007) . li et al. tried to provide a molecular basis for 1000-year-old concept of zheng using a neuro-endocrine-immune (nei) network . zheng is the basic unit and key concept in tcm theory. it is also used as a guideline in disease classification in tcm. the hot (hans zheng in mandarin) and cold (re zheng) are the two statuses of zheng which therapeutically directs the use of herbs in tcm. chinese herbs are classified as hotàcooling and are used to remedy hot zheng and coldàwarming herbs that are used to remedy cold zheng. according to the authors, hormones may be related to hot zheng, immune factors may be related to cold zheng, and they may be interconnected by neurotransmitters. this study provides a methodical approach to understand tcm within the framework of modern science. later they reconstructed the nei network by adding multilayer information including data available on the kegg database related to signal transduction, metabolic pathways, proteinàprotein interactions, transcription factor, and micro rna regulations. they also connected drugs and diseases through multilayered interactions. the study of cold zheng emphasized its relation to energy metabolism, which is tightly correlated with the genes of neurotransmitters, hormones, and cytokines in the nei interaction network ma et al., 2010) . another database, tcmgenedit, provides information about tcms, genes, diseases, tcm effects, and tcm ingredients mined from a vast amount of biomedical literature. this would facilitate clinical research and elucidate the possible therapeutic mechanisms of tcms and gene regulations (fang et al., 2008) . to study the combination rule of tcm formulae, an herb network was created using 3865 collaterals-related formulae . they developed a distance-based, mutual-information model (dmim) to uncover the combination rule. dmim uses mutual-information entropy and "between herb distance" to measure the tendency of two herbs to form an herb pair. they experimentally evaluated the combination of a few herbs for angiogenesis. understanding the combination rule of herbs in formulae will help the modernization of traditional medicine and also help to develop a new formulae based on the current requirement. a network targetàbased paradigm was proposed for the first time to understand the synergistic combinations , and an algorithm termed "nims" (network tar-getàbased identification of a multicomponent synergy) was also developed. this was a step that facilitated the development of multicomponent therapeutics using traditional wisdom. an innovative way to study the molecular mechanism of tcm was proposed during this time by integrating the tcm experimental data with microarray gene expression data (wen et al., 2011) . as a demonstrative example, si-wu-tang's formula was studied. rather than uncovering the molecular mechanism of action, this method would help to identify new health benefits of tcms. the initial years of the second decade of the 21st century witnessed the network ethnopharmacological exploration of tcm formulations. the scope of this new area attracted scientists, and they hoped nep could provide insight into multicompound drug discoveries that could help overcome the current impasse in drug discovery (patwardhan, 2014b; . nep was used to study the antiinflammatory mechanism of qingfei xiaoyan, a tcm . the predicted results were used to design experiments and analyze the data. experimental confirmation of the predicted results provides an effective strategy for the study of traditional medicines. the potential of tcm formulations as multiple compound drug candidates has been studied using tcm formulations based np. tcm formulations studied in this way are listed in table 5 .1. construction of a database containing 19,7201 natural product structures, followed by their docking to 332 target proteins of fda-approved drugs, shows the amount of space shared in the chemical space between natural products and fda drugs (gu et al., 2013a) . molecular-docking technique plays a major role in np. the interaction of bioactives with molecular targets can be analyzed by this technique. molecular dockingàbased nep can be a useful tool to computationally elucidate the combinatorial effects of traditional medicine to intervene disease networks (gu et al., 2013c ). an approach that combines np and pharmacokinetics has been proposed to study the material basis of tcm formulations (pei et al., 2013) . this can be extrapolated to study other traditional medicine formulations as well. in cancer research, numerous natural products have been demonstrated to have anticancer potential. natural products are gaining attraction in anticancer research, as they show a favorable profile in terms of absorption and metabolism in the body with low toxicity. in a study all of the known bioactives were docked for their property to interact with 104 cancer targets (luo et al., 2014) . it was inferred that many bioactives are targeting multiple ejiao slurry regulates cancer cell differentiation, growth, proliferation, and apoptosis, and shows an adjuvant therapeutic effect that enriches the blood and increases immunity xu et al. (2014b) xiao-chaihu decoction and da chaihu-decoction xchd treats diseases accompanying symptoms of alternating fever and chills, no desire for food or drink, and dry pharynx, while dchd treats those with symptoms of fullness, pain in abdomen, and constipation. dragon's blood used in colitis and acts through interaction with 26 putative targets xu et al. (2014a) protein targets and thus are linked to many types of cancers. np coupled to sophisticated spectroscopical analysis such as ultra-performance liquid chromatographyàelectrospray, ionizationàtandem mass spectroscopy (uplc-esi-ms/ms) is a useful approach to study the absolute molecular mechanism of action of botanical formulations based on their constituent bioactives (xu et al., 2014a) . bioactiveàtarget analysis has shown that some of the botanical formulations are more effective than their corresponding marketed drugàtarget interactions . this indicates the potential of np to better understand the power of botanical formulations and to develop efficient and economical treatment options. the holistic approach of botanical formulations can be better explained by np. a study has reported this property by exemplifying a tcm formulation against viral infectious disease . not only does the formulation target the proteins in the viral infection cycle, but it also regulates the proteins of the host defense system; thus, it acts in a very distinctive manner. this unique property of formulations is highly efficient for strengthening the broad and nonspecific antipathogenic actions. thus, network-based, multitarget drugs can be developed by testing the efficacy of the formulation, identifying, and isolating the major bioactives and redeveloping a multicomponent therapeutic using the major bioactives based on synergism (leung et al., 2013) . np also serves to document and analyze the clinical prescriptions of traditional medicine practitioners . a traditional medicine network that links bioactives to clinical symptoms through targets and diseases is a novel way to explore the basic principles of traditional medicines (luo et al., 2015) . the network-based approaches provide a systematic platform for the study of multicomponent traditional medicine and has applications for its beneficial modernization. this platform not only recovers traditional knowledge, but it also provide new findings that can be used for resolving current problems in the drug industry . this section explains a handful of ethnopharmacological networks that were developed to understand the scientific rationale of traditional medicine. dragon's blood (db) tablets, which are made of resins from dracaena spp., daemonorops spp., croton spp., and pterocarpus spp., is an effective tcm for the treatment of colitis. in a study, an np-based approach was adopted to provide new insights relating to the active constituents and molecular mechanisms underlying the effects of db (xu et al., 2014a) . the constituent chemicals of the formulation were identified using an ultraperformance liquid chromatography-electrospray ionization-tandem mass spectrometry method. the known targets of those identified 48 compounds were mined from literature and putative targets that were predicted with the help of computational tools. the compounds were further screened for bioavailability followed by the systematic analysis of the known and putative targets for colitis. the network evaluation revealed the mechanism of action of db bioactives for colitis through the modulation of the proteins of the nod-like receptor signaling pathway (fig. 5.1) . the antioxidant mechanism of zhi-zi-da-huang decoction as an approach to treat alcohol liver disease was elucidated using np an and feng, 2015 ). an endothelial cell proliferation assay was performed for an antiangiogenic alkaloid, sinomenine, to validate the network targetàbased identification of multicomponent synergy (nims) predictions. the study was aimed at evaluating the synergistic relationship between different pairs of therapeutics, and sinomenine was found to have a maximum inhibition rate with matrine, both through the network and in vitro studies. the discovery of bioactives and elucidation of the mechanism of action of the herbal formulae, qing-luo-yin and the liu-wei-di-huang pill, using np, has given insight to the design validation experiments that accelerated the process of drug discovery . validation experiments based on the network findings regarding cold zheng and hot zheng on a rat model of collagenàinduced arthritis showed that the cold zhengàoriented herbs tend to affect the hub nodes in the cold zheng network, and the hot zheng-oriented herbs tend to affect the hub nodes in the hot zheng network . np was used to explain the addition and subtraction theory of tcm. two decoctions: xiao chaihu and da chaihu were studied using np approach to investigate this theory. according to the addition and subtraction theory, the addition or removal of one or more ingredients from a traditional formulation resulted in a modified formula that plays a vital role in individualized medicine. compounds from additive herbs were observed to be more efficient on diseaseàassociated targets (fig. 5.2) . these additive compounds were found to act on 93 diseases through 65 drug targets (li et al., 2014a) . experimental verification of the antithrombotic network of fufang xueshuantong (fxst) capsule was done through in vivo studies on lipopoly-saccharideàinduced disseminated intravascular coagulation (dic) rat model. it was successfully shown that fxst significantly improves the activation of the coagulation system through 41 targets from four herbs (sheng et al., 2014) . np analysis of the bushenhuoxue formula showed that six components-rhein, tanshinone iia, curcumin, quercetin and calycosin-acted through 62 targets for the treatment of chronic kidney disease. these predictions were validated using unilateral ureteral obstruction models, and it was observed that even though the individual botanicals showed a significant decrease in creatinine levels, the combination showed lower blood creatinine and urea nitrogen levels (shi et al., 2014) . the antidiabetic effects of ge-gen-qin-lian decoction were investigated using an insulin secretion assay, and an insulinàresistance model using 13 of the 19 ingredients showed antidiabetic activity using np studies (li et al., 2014b) . to confirm the predictions of the network of liu-wei-di-huang pill, four proteins-pparg, rara, ccr2, and esr1-that denote different functions and are targeted by different groups of ingredients were chosen. the interactions between various bioactives and their effect on the expression of the proteins showed that the np approach can accurately predict these interactions, giving hints regarding the mechanism of action of the compounds (liang et al., 2014) . experimental results confirmed that the 30 core ingredients in modified simiaowan, obtained through network analysis, significantly increased huvec viability and attenuated the expression of icam-1 and proved to be effective in gout treatment (zhao et al., 2015) . the role of anthraquinone and flavanols (catechin and epicatechin) in the therapeutic potential of rhubarb in renal interstitial fibrosis was examined using network analysis and by conventional assessment involving serum biochemistry, histopathological, and immunohistochemical assays (xiang et al., 2015) . in silico analysis and experimental validation demonstrated that compound 11/12 of fructus schisandrae chinensis targets gba3/shbg . np is a valuable method to study the synergistic effects of bioactives of traditional medicine formulation. this was experimentally shown on the sendeng-4 formulation for rheumatoid arthritis (fig. 5.3 ). data and network analysis have shown that the formulation acts synergistically through nine categories of targets (zi and yu, 2015) . another network that studied three botanicals, salviae miltiorrhizae, ligusticum chuanxiong, and panax notoginseng for coronary artery disease (cad), displayed their mode of action through 67 targets, out of which 13 are common among the botanicals (fig. 5.4) . these common targets are associated with thrombosis, dyslipidemia, vasoconstriction, and inflammation . this gives insight to how these botanicals are managing cad. another approach using np is the construction of networks based on experimental data followed by literature mining. this method is very effective for large space data analysis, which will help to derive the mechanism of action of the formulation. a network of qishenyiqi formulation having cardioprotective effects, constructed based on the microarray data and the published literature, showed that 9 main compounds were found to act through 16 pathways, out of which 9 are immune and inflammation-related (li et al., 2014c) . the mechanism of action for the bushen zhuanggu formulation was proposed based on lc-ms/ms standardization, pharmacokinetic analysis, and np (pei et al., 2013) . the efficacy of shenmai injection was evaluated using a rat model of myocardial infarction, genome-wide transcriptomic experiment, and then followed by a np analysis. the overall trends in the ejection fraction and fractional shortening were consistent with the networkàrecovery index (nri) from the network . in order to develop an ethnopharmacological network, exploring the existing databases to gather information regarding bioactives and targets is the first step. further information such as target-related diseases, tissue distribution and pathways are also to be collected depending on the type of study that is going to be undertaken. the universal natural products database (unpd) (gu et al., 2013a ) is one of the major databases that provides bioactives information. other databases that provide information regarding bioactives include cvdhd (gu et al., 2013b) , tcmsp (ru et al., 2014) , tcm@taiwan (sanderson, 2011) , supernatural (banerjee et al., 2015) , and dr. dukes's phytochemical and ethnobotanical database (duke and beckstrom-sternberg, 1994) . the molecular structures of bioactives are usually stored as "sd" files and chemical information as smiles and inchkeys in these databases. any of these file formats can be used as inputs to identify the targets in protein information databases. binding database or "binding db" (liu et al., 2007) and chembl (bento et al., 2014) are databases for predicting target proteins. binding db searches the exact or similar compounds in the database and retrieves the target information of those compounds. the similarity search gives the structurally similar compounds with respect to the degree of similarity as scores to the queried structure. the information regarding both annotated and predicted targets can be collected in this way. this database is connected to numerous databases, and these connections can be used to extract further information regarding the targets. the important databases linked to binding db are uniprot (bairoch et al., 2005) , which gives information related to proteins and genes; reactome, a curated pathway database (croft et al., 2011) ; and the kyoto encyclopedia of genes and genomes (kegg), a knowledge base for systematic analysis of gene functions and pathways (ogata et al., 1999) . therapeutic targets database (ttd) (zhu et al., 2012) gives fully referenced information of targeted diseases of proteins, their pathway information, and the corresponding drug directed to each target. disease and gene annotation (dga), a database that provides a comprehensive and integrative annotation of human genes in disease networks, is useful in identifying the disease type that each indication belongs to (peng et al., 2013) . the human protein atlas (hpa) database (pontén et al., 2011) is an open database showing the spatial distribution of proteins in 44 different normal human tissues. the information of the distribution of proteins in tissues can be gathered from hpa. the database also gives information regarding subcellular localization and protein class. an overall review of the methods to implement np for herbs and herbal formulations is also available, including a systematic review of the databases that one could use for the same (kibble et al., 2015; lagunin et al., 2014) . integration of knowledge bases helps data gathering for network pharmacological studies, and its knowledge base shows the inter-relationships among these databases (fig. 5 .5) . the counts of entities, such as bioactives, targets, and diseases, can vary based on the knowledge bases that are relied on for data collection. an integration of knowledge bases can overcome this limitation. another factor that affects the counts of these entities is the time frame for data collection. this change occurs due to the ongoing, periodic updates of the databases. a network is the schematic representation of the interaction among various entities called nodes. in pharmacological networks, the nodes include bioactives, targets, tissue, tissue types, disease, disease types, and pathways. these nodes are connected by lines termed edges, which represent the relationship between them (morris et al., 2012) . building a network involves two opposite approaches: a bottom-up approach on the basis of established biological knowledge and a top-down approach starting with the statistical analysis of available data. at a more detailed level, there are several ways to build and illustrate a biological network. perhaps the most versatile and general way is the de novo assembly of a network from direct experimental or computational interactions, e.g., chemical/gene/protein screens. networks encompassing biologically relevant nodes (genes, proteins, metabolites), their connections (biochemical and regulatory), and modules (pathways and functional units) give an authentic idea of the real biological phenomena (xu and qu, 2011) . cytoscape, a java-based open source software platform (shannon et al., 2003) , is a useful tool for visualizing molecular interaction networks and integrating them with any type of attribute data. in addition to the basic set of features for data integration, analysis, and visualization, additional features are available in the form of apps, including network and molecular profiling analysis and links with other databases. in addition to cytoscape, a number of visualization tools are available. visual network pharmacology (vnp) , which is specially designed to visualize the complex relationships among diseases, targets, and drugs, mainly contains three functional modules: drug-centric, target-centric, and disease-centric vnp. this disease-target-drug database documents known connections among diseases, targets, and the usfda-approved drugs. users can search the database using disease, target, or drug name strings; chemical structures and substructures; or protein sequence similarity, and then obtain an online interactive network view of the retrieved records. in the obtained network view, each node is a disease, target, or drug, and each edge is a known connection between two of them. the connectivity map, or the cmap tool, allows the user to compare gene-expression profiles. the similarities or differences in the signature transcriptional expression profile and the small molecule transcriptional response profile may lead to the discovery of the mode of action of the small molecule. the response profile is also compared to response profiles of drugs in the cmap database with respect to the similarity of transcriptional responses. a network is constructed and the drugs that appear closest to the small molecule are selected to have better insight into the mode of action. other software, such as gephi, an exploration platform for networks and complex systems, and cell illustrator, a java-based tool specialized in biological processes and systems, can also be used for building networks . ayurveda, the indian traditional medicine, offers many sophisticated formulations that have been used for hundreds of years. the traditional knowledge digital library (tkdl, http://www.tkdl.res.in) contains more than 36,000 classical ayurveda formulations. approximately 100 of these are popularly used at the community level and also as over-the-counter products. some of these drugs continue to be used as home remedies for preventive and primary health care in india. until recently, no research was carried out to explore ayurvedic wisdom using np despite ayurveda holding a rich knowledge of traditional medicine equal to or greater than tcm. our group examined the use of np to study ayurvedic formulations with the wellknown ayurvedic formulation triphala as a demonstrable example (chandran et al., 2015a, b) . in this chapter, we demonstrate the application of np in understanding and exploring the traditional wisdom with triphala as a model. triphala is one of the most popular and widely used ayurvedic formulations. triphala contains fruits of three myrobalans: emblica officinalis (eo; amalaki) also known as phyllanthus emblica; terminalia bellerica (tb; vibhitaka); and terminalia chebula (tc; haritaki). triphala is the drug of choice for the treatment of several diseases, especially those of metabolism, dental, and skin conditions, and treatment of cancer (baliga, 2010) . it has a very good effect on the health of heart, skin, eyes, and helps to delay degenerative changes, such as cataracts (gupta et al., 2010) . triphala can be used as an inexpensive and nontoxic natural product for the prevention and treatment of diseases where vascular endothelial growth factor aàinduced angiogenesis is involved . the presence of numerous polyphenolic compounds empowers it with a broad antimicrobial spectrum (sharma, 2015) . triphala is a constituent of about 1500 ayurveda formulations and it can be used for several diseases. triphala combats degenerative and metabolic disorders possibly through lipid peroxide inhibition and free radical scavenging (sabu and kuttan, 2002) . in a phase i clinical trial on healthy volunteers, immunostimulatory effects of triphala on cytotoxic t cells and natural killer cells have been reported (phetkate et al., 2012) . triphala is shown to induce apoptosis in tumor cells of the human pancreas, in both in vitro and in vivo models (shi et al., 2008) . although the anticancer properties of triphala have been studied, the exact mechanism of action is still not known. the beneficial role of triphala in disease management of proliferative vitreoretinopathy has also been reported (sivasankar et al., 2015) . one of the key ingredients of triphala is amalaki. some studies have already shown the beneficial effect of amalaki rasayana to suppress neurodegeneration in fly models of huntington's and alzheimer's diseases (dwivedi et al., 2012 (dwivedi et al., , 2013 . triphala is an effective medicine to balance all three dosha. it is considered as a good rejuvenator rasayana, which facilitates nourishment to all tissues, or dhatu. here we demonstrate the multidimensional properties of triphala using human proteome, diseasome, and microbial proteome targeting networks. the botanicals of triphala-eo, tb, and tc-contain 114, 25, and 63 bioactives, respectively, according to unpd data collected during june 2015. of these, a few bioactives are common among the three botanicals. thus, triphala formulation as a whole contains 177 bioactives. out of these, 36 bioactives were score-1, based on binding db search carried out during june 2015. eo, tb, and tc contain 20, 4, and 20 score-1 bioactives, respectively ( fig. 5.6 ). the score-1 bioactives that are common among three plants are chebulanin, ellagic acid, gallussaeure, 1,6-digalloyl-beta-d-glucopiranoside, methyl gallate, and tannic acid. this bioactive information is the basic step toward constructing human proteome and microbial proteome targeting networks. thirty-six score-1 bioactives of triphala are shown to interact with 60 human protein targets in 112 combinations (fig. 5.7) . quercetin, ellagic acid, 1,2,3,4,6-pentagalloylglucose and 1,2,3,6-tetrakis-(o-galloyl)-beta-d-glucose are the four bioactives that interact with the maximum number of targets: 21, 16, and 7, respectively. the other major bioactives that have multitargeting property include catechin; epicatechin; gallocatechin; kaempferol; and trans-3,3',4',5,7-pentahydroxylflavane. the major protein targets of triphala include alkaline phosphatase (alpl); carbonic anhydrase 7 (ca7); coagulation factor x (f10), dna repair protein rad51 homolog 1 (rad51); gstm1 protein (gstm1); beta-secretase 1 (bace1); plasminogen activator inhibitor 1 (serpine1), prothrombin (f2); regulator of g-protein signaling (rgs) 4, 7, and 8, tissue-type plasminogen activator (plat); and tyrosineprotein phosphatase nonreceptor type 2 (ptpn2). the 60 targets of triphala are associated with 24 disease types, which include 130 disease indications (fig. 5.8) . the major disease types in which triphala targets are associated include cancers, cardiovascular diseases, nervous system diseases, and metabolic diseases. analysis of existing data indicates that targets of triphala bioactives are involved in the 40 different types of cancers making it the largest group of diseases, involving triphala targets. this linkage is through the interaction of 25 bioactives and 27 target proteins in 46 different bioactiveàtarget combinations. the types of cancers which are networked by triphala include pancreatic, prostate, breast, lung, colorectal and gastric cancers, tumors, and more. quercetin, ellagic acid, prodelphinidin a1, and 1,2,3-benzenetriol are the important bioactives; and rad51, bace1, f2, mmp2, igf1r, and egfr are the important targets that play a role in cancer. triphala shows links to 18 indications of cardiovascular diseases through 12 bioactives and 11 targets. the cardiovascular diseases that are covered in the triphala network include atherosclerosis, myocardial ischemia, infarction, cerebral vasospasm, thrombosis, and hypertension. the bioactives playing a major role in cardiovascular diseases are quercetin, 1,2,3,4,6-pentagalloyoglucose, 1,2,3,6-tetrakis-(o-galloyl)-beta-d-glucose, bellericagenin a1, and prodelphinidin a1, whereas the targets playing an important role are serpine1, f10, f2, and fabp4. triphala's network to nervous system disorders contains 13 diseases in which the significant ones are alzheimer's disease, parkinson's disease, diabetic neuropathy, and retinopathy. in this subnetwork, 14 bioactives interact with 11 targets through 21 different interactions. quercetin, 1,2,3,4,6-pentagalloyoglucose, 1,2,3,6-tetrakis-(o-galloyl)-beta-d-glucose, and epigallocatechin-3-gallate are the most networked bioactives whereas the most networked targets are bace1, serpine1, plat, aldr, ca2. the association of triphala with metabolic disorders is determined by six bioactives that interact with seven targets. the major metabolic diseases come in this link are obesity, diabetic complications, noninsulin-dependent diabetes, hypercholesterolemia, hyperlipidemia, and more. the bioactives having more interactions with targets are ellagic acid, quercetin, and bellericagenin a1, whereas the highly networked targets are igf1r, fabp5, aldr, and akr1b1. triphala bioactives are also linked to targets of other diseases comprising autoimmune diseases, ulcerative colitis, mccuneàalbright syndrome, psoriasis, gout, osteoarthritis, endometriosis, lung fibrosis, glomerulonephritis, and more. the proteome-targeting network of triphala, thus, shows its ability to synergistically modulate 60 targets that are associated with 130 disease indications. this data is generated with the available information that included only one-fifth of the total number of bioactives. further logical analysis and experimental studies based on the network result are needed to explore the in-depth mechanism of action of triphala. for researchers in this area, these kind of networks can give an immense amount of information that can be developed further to reveal the real mystery behind the actions of traditional medicine. triphala is also referred to as a "tridoshic rasayana," as it balances the three constitutional elements of life. it tonifies the gastrointestinal tract, improves digestion, and is known to exhibit antiviral, antibacterial, antifungal, and antiallergic properties (sharma, 2015; amala and jeyaraj, 2014; sumathi and parvathi, 2010) . triphala mashi (mashi: black ash) was found to have nonspecific antimicrobial activity, as it showed a dose-dependent inhibition of gram-positive and gram-negative bacteria (biradar et al., 2008) . hydroalcoholic, aqueous, and ether extracts of the three fruits of triphala were reported to show antibacterial activity against uropathogens with a maximum drug efficacy recorded by the alcoholic extract (bag et al., 2013; prasad et al., 2009) . the methanolic extract of triphala showed the presence of 10 active compounds using gc-ms and also showed potent antibacterial and antifungal activity (amala and jeyaraj, 2014) . triphala has been well studied for its antimicrobial activity against gram-positive bacteria, gram-negative bacteria, fungal species, and different strains of salmonella typhi (amala and jeyaraj, 2014; sumathi and parvathi, 2010; gautam et al., 2012; srikumar et al., 2007) . triphala showed significant antimicrobial activity against enterococcus faecalis and streptococcus mutans grown on tooth substrate thereby making it a suitable agent for prevention of dental plaque (prabhakar et al., 2010 (prabhakar et al., , 2014 . the application of triphala in commercial antimicrobial agents has been explored. a significant reduction in the colony forming units of oral streptococci was observed after 6% triphala was incorporated in a mouthwash formulation (srinagesh et al., 2012 ). an ointment prepared from triphala (10% (w/w)) showed significant antibacterial and wound healing activity in rats infected with staphylococcus aureus, pseudomonas aeruginosa, and streptococcus pyogenes (kumar et al., 2008) . the antiinfective network of triphala sheds light on the efficacy of the formulation in the simultaneous targeting of multiple microorganisms. also, this network provides information regarding some novel bioactiveàtarget combinations that can be explored to combat the problem of multidrug resistance. among the bioactives of triphala, 24 score-1 bioactives target microbial proteins of 22 microorganisms. the botanicals of triphala-eo, tb, and tccontain 19, 3, and 8 score1 bioactives respectively which showed interactions with microbial proteins. they act through modulation of 35 targets which are associated with diseases such as leishmaniasis, malaria, tuberculosis, hepatitis c, acquired immunodeficiency syndrome (aids), cervical cancer, candidiasis, luminous vibriosis, yersiniosis, skin and respiratory infections, severe acute respiratory syndrome (sars), avian viral infection, bacteremia, sleeping sickness, and anthrax ( fig. 5.9 ). the microorganisms captured in the triphala antiinfective network includes candida albicans, hepatitis c virus, human immunodeficiency virus 1, human papillomavirus type 16, human sars coronavirus leishmania amazonensis, mycobacterium tuberculosis, staphylococcus aureus, plasmodium falciparum, and yersinia enterocolitica. in mycobacterium tuberculosis, dtdp-4-dehydrorhamnose 3,5-epimerase rmlc is one of the four enzymes involved in the synthesis of dtdp-l-rhamnose, a precursor of l-rhamnose (giraud et al., 2000) . the network shows that triphala has the potential to modulate the protein through four bioactives such as punicalins, terflavin b, 4-o-(s)-flavogallonyl-6-o-galloylbeta-d-glucopyranose, and 4,6-o-(s,s)-gallagyl-alpha/beta-d-glucopyranose. research on new therapeutics that target the mycobacterial cell wall is in progress. rhamnosyl residues play a structural role in the mycobacterial cell wall by acting as a linker connecting arabinogalactin polymer to peptidoglycan and are not found in humans, which gives them a degree of therapeutic potential (ma et al., 2001) . triphala can be considered in this line to develop novel antimycobacterial drugs. the network shows the potential of gallussaeure and 3-galloylgallic acid to modulate human immunodeficiency virus type 1 reverse transcriptase. inhibition of human immunodeficiency virus at the initial stage itself is crucial and thus, targeting human immunodeficiency virus type 1 reverse transcriptase, at the preinitiation stage is considered to be an effective therapy. protein e6 of human papillomavirus 16 (hpv16) prevents apoptosis of figure 5.9 the microbial proteomeàtargeting network of triphala. dark green verus are the botanicals of triphala and oval green nodes are the score1 bioactives. targets, diseases, and microorganisms are represented by blue diamond nodes, red triangle nodes, and pink octagon nodes, respectively. infected cells by binding to fadd and caspase 8 and hence being targeted for development of antiviral drugs (yuan et al., 2012) . kaempferol of triphala is found to target protein e6 of hpv16, which is a potential mechanism to control the replication of the virus. the network also shows triphala's potential to act on plasmodium falciparum. enoyl-acyl carrier protein reductase (enr) has been investigated as an attractive target due to its important role in membrane construction and energy production in plasmodium falciparum (nicola et al., 2007) while the parasite interacts with human erythrocyte spectrin and other membrane proteins through protein m18 aspartyl aminopeptidase (lauterbach and coetzer, 2008) . trans-3,3',4',5,7-pentahydroxylflavane, epigallocatechin, and epicatechin can modulate both while epigallocatechin 3-gallate can regulate enoyl-acyl carrier protein reductase and, quercetin and vanillic acid can act on m18 aspartyl aminopeptidase. epigallocatechin 3-gallate can also target 3-oxoacyl-acyl-carrier protein reductase which is a potent therapeutic target because of its role in type ii fatty acid synthase pathway of plasmodium falciparum (karmodiya and surolia, 2006) . epigallocatechin 3-gallate and quercetin are the bioactives that have shown maximum antimicrobial targets interaction. while epigallocatechin 3-gallate shows interaction with 3-oxoacyl-(acyl-carrier protein) reductase, cpg dna methylase, enoyl-acyl-carrier protein reductase, glucose-6phosphate 1-dehydrogenase, hepatitis c virus serine protease, ns3/ns4a and yoph of plasmodium falciparum, saccharomyces cerevisiae, and spiroplasma monobiae; quercetin acts on 3c-like proteinase (3cl-pro), arginase, beta-lactamase ampc, glutathione reductase, m18 aspartyl aminopeptidase, malate dehydrogenase and tyrosine-protein kinase transforming protein fps of escherichia coli, fujinami sarcoma virus, human sars coronavirus (sars-cov), leishmania amazonensis, plasmodium falciparum, saccharomyces cerevisiae, and thermus thermophiles. np has gained impetus as a novel paradigm for drug discovery. this approach using in silico data is fast becoming popular due to its cost efficiency and comparably good predictability. thus, network analysis has various applications and promising future prospects with regard to the process of drug discovery and development. np has proven to be a boon for drug research, and that helps in the revival of traditional knowledge. albeit there are a few limitations of using np for nep studies. this is because the bioactives form the foundation of any traditional medicine network. 2. absorption, distribution, metabolism, excretion, and toxic effects (admet) parameters associated with the bioactives/formulation when they are administered in the form of the medicine need to be considered in order to extrapolate in silico and cheminformatics data to in vitro and in vivo models. in silico tools that offer the prediction of these parameters can be depended on for this. but traditional medicines are generally accompanied by a vehicle for delivery of the medicine. these vehicles, normally various solvents-water, milk, lemon juice, butter, ghee (clarified butter), honey-that alter the solubility of the bioactives, play a role in regulating admet parameters. experimental validation studies are required to evaluate this principle of traditional medicine. 3. target identification usually relies on a single or a few databases due to the limited availability of databases with free access. this can occasionally give incomplete results. also, there may be novel targets waiting to be discovered that could be a part of the mechanism of action of the bioactives. to deal with this discrepancy in the network, multiple databases should be considered for target identification. integration of databases serving similar functions can also be a solution for this problem. in addition to this, experimental validation of the target molecules using proteinàprotein interaction studies or gene expression studies will provide concrete testimony to the network predictions. 4. a number of traditional medicines act through multiple bioactives and targets. synergy in botanical drugs helps to balance out the extreme pharmacological effects that individual bioactives may have. the interactions of bioactives with various target proteins, their absorption into the body after possible enzyme degradation, their transport, and finally their physiological effect are a crucial part of traditional medicine (gilbert and alves, 2003) . however, in vitro assays or in silico tools are unable to give a clear idea as to the complete and exact interactions in a living organism. np is only the cardinal step toward understanding the mechanism of bioactives/formulations. but this gives an overview of the action of traditional medicine which can be used to design in vivo experiments and clinical trials. this saves time and cost of research and inventions. 5. it is observed that formulations are working by simultaneous modulation of multiple targets. this modulation includes activation of some targets and inhibition of other. in order to understand this complex synergistic activity of formulation, investigative studies regarding the interactions of ligands with targets are to be carried out. this can be achieved by implementing high-throughput omics studies based on the network data. network pharmacological analysis presents an immense scope for exploring traditional knowledge to find solutions for the current problems challenging the drug discovery industry. nep can also play a key role in new drug discovery, drug repurposing, and rational formulation discovery. many of the bioactiveàtarget combinations have been experimentally studied. the data synthesis using np provides information regarding the mode of action of traditional medicine formulations, based on their constituent bioactives. this is a kind of reverse approach to deduce the molecular mechanism of action of formulations using modern, integrated technologies. the current network analysis is based on the studies that have been conducted and the literature that is available. hence, the data is inconclusive as a number of studies are still underway and novel data is being generated continuously. despite its limitations, this still is a favorable approach, as it gives insight into the hidden knowledge of our ancient traditional medicine wisdom. np aids the logical analysis of this wisdom that can be utilized to understand the knowledge as well as to invent novel solutions for current pharmacological problems. determination of antibacterial, antifungal, bioactive constituents of triphala by ft-ir and gc-ms analysis antibacterial potential of hydroalcoholic extracts of triphala components against multidrug-resistant uropathogenic bacteria--a preliminary report triphala, ayurvedic formulation for treating and preventing cancer: a review super natural ii--a database of natural products network biology: understanding the cell's functional organization the chembl bioactivity database: an update network analyses in systems pharmacology exploring of antimicrobial activity of triphala mashi-an ayurvedic formulation. evidence-based complement approaches to drug discovery omic techniques in systems biology approaches to traditional chinese medicine research: present and future network pharmacology: an emerging technique for natural product drug discovery and scientific research on ayurveda network pharmacology of ayurveda formulation triphala with special reference to anti-cancer property molecular mechanism research on simultaneous therapy of brain and heart based on data mining and network analysis anti-inflammatory mechanism of qingfei xiaoyan wan studied with network pharmacology chapter 5: network biology approach to complex diseases progress in computational methods for the prediction of admet properties reactome: a database of reactions, pathways and biological processes mechanism study on preventive and curative effects of buyang huanwu decoction in qi deficiency and blood stasis diseases based on network analysis an analysis of chemical ingredients network of chinese herbal formulae for the treatment of coronary heart disease in vivo effects of traditional ayurvedic formulations in drosophila melanogaster model relate with therapeutic applications ayurvedic amalaki rasayana and rasa-sindoor suppress neurodegeneration in fly models of huntington's and alzheimer's diseases tc;mgenedit: a database for associated traditional chinese medicine, gene and disease information using text mining antifungal potential of triphala churna ingredients against aspergillus species associated with them during storage rmlc, the third enzyme of dtdp-l-rhamnose pathway, is a new class of epimerase identification of responsive gene modules by networkbased gene clustering and extending: application to inflammation and angiogenesis use of natural products as chemical library for drug discovery and network pharmacology cvdhd: a cardiovascular disease herbal database for drug discovery and network pharmacology understanding traditional chinese medicine antiinflammatory herbal formulae by simulating their regulatory functions in the human arachidonic acid metabolic network evaluation of anticataract potential of triphala in selenite-induced cataract: in vitro and in vivo studies pharmacology: principles and practice online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders network pharmacology: the next paradigm in drug discovery vnp: interactive visual network pharmacology of diseases, targets, and drugs detection of characteristic sub pathway network for angiogenesis based on the comprehensive pathway network analyses of co-operative transitions in plasmodium falciparum beta-ketoacyl acyl carrier protein reductase upon co-factor and acyl carrier protein binding network pharmacology applications to map the unexplored target space and therapeutic potential of natural products triphala promotes healing of infected full-thickness dermal wound chemo-and bioinformatics resources for in silico drug discovery from medicinal plants beyond their traditional use: a critical review network-based drug discovery by integrating systems biology and computational technologies systems pharmacologybased approach for dissecting the addition and subtraction theory of traditional chinese medicine: an example using xiao-chaihu-decoction and da-chaihu-decoction a network pharmacology approach to determine active compounds and action mechanisms of ge-gen-qin-lian decoction for treatment of type 2 diabetes traditional chinese medicinebased network pharmacology could lead to new multicompound drug discovery framework and practice of network-based studies for chinese herbal formula traditional chinese medicine network pharmacology: theory, methodology and application. chin constructing biological networks through combined literature mining and microarray analysis: a lmma approach understanding zheng in traditional chinese medicine in the context of neuro-endocrine-immune network herb network construction and comodule analysis for uncovering the combination rule of traditional chinese herbal formulae network target for screening synergistic drug combinations with application to traditional chinese medicine analysis on correlation between general efficacy and chemical constituents of danggui-chuanxiong herb pair based on artificial neural network network pharmacology study on major active compounds of fufang danshen formula a network pharmacology study of chinese medicine qishenyiqi to reveal its underlying multi-compound, multitarget, multi-pathway mode of action herb network analysis for a famous tcm doctor' s prescriptions on treatment of rheumatoid arthritis. evidence-based complement a novel network pharmacology approach to analyse traditional herbal formulae: the liu-wei-di-huang pill as a case study bindingdb: a web-accessible database of experimentally determined protein-ligand binding affinities network pharmacology study on major active compounds of siwu decoction analogous formulae for treating primary dysmenorrhea of gynecology blood stasis syndrome computational pharmacological comparison of salvia miltiorrhiza and panax notoginseng used in the therapy of cardiovascular diseases triphala and its active constituent chebulinic acid are natural inhibitors of vascular endothelial growth factor-a mediated angiogenesis computational identification of potential microrna network biomarkers for the progression stages of gastric cancer systems pharmacology strategies for anticancer drug discovery based on natural products multiscale modeling of druginduced effects of reduning injection on human disease: from drug molecules to clinical symptoms of disease drug targeting mycobacterium tuberculosis cell wall synthesis: genetics of dtdp-rhamnose synthetic enzymes and development of a microtiter plate-based screen for inhibitors of conversion of dtdp-glucose to dtdp-rhamnose bridging the gap between traditional chinese medicine and systems biology: the connection of cold syndrome and nei network analysis and visualization of biological networks with cytoscape discovery of novel inhibitors targeting enoyl-acyl carrier protein reductase in plasmodium falciparum by structure-based virtual screening global mapping of pharmacological space rediscovering drug discovery network ethnopharmacology approaches for formulation discovery integrative approaches for health: biomedical research, ayurveda and yoga material basis of chinese herbal formulas explored by combining pharmacokinetics with network pharmacology the disease and gene annotations (dga): an annotation resource for human disease significant increase in cytotoxic t lymphocytes and natural killer cells by triphala: a clinical phase i study. evid. based complement alternat the human protein atlas as a proteomic resource for biomarker discovery evaluation of antimicrobial efficacy of herbal alternatives (triphala and green tea polyphenols), mtad, and 5% sodium hypochlorite against enterococcus faecalis biofilm formed on tooth substrate: an in vitro study evaluation of antimicrobial efficacy of triphala (an indian ayurvedic herbal formulation) and 0.2% chlorhexidine against streptococcus mutans biofilm formed on tooth substrate: an in vitro study potent growth suppressive activity of curcumin in human breast cancer cells: modulation of wnt/beta-catenin signaling tcmsp: a database of systems pharmacology for drug discovery from herbal medicines anti-diabetic activity of medicinal plants and its relationship with their antioxidant property databases aim to bridge the east-west divide of drug discovery cytoscape: a software environment for integrated models of biomolecular interaction networks network pharmacology analyses of the antithrombotic pharmacological mechanism of fufang xueshuantong capsule with experimental support using disseminated intravascular coagulation rats a network pharmacology approach to understanding the mechanisms of action of traditional medicine: bushenhuoxue formula for treatment of chronic kidney disease triphala inhibits both in vitro and in vivo xenograft growth of pancreatic tumor cells by inducing apoptosis aqueous and alcoholic extracts of triphala and their active compounds chebulagic acid and chebulinic acid prevented epithelial to mesenchymal transition in retinal pigment epithelial cells, by inhibiting smad-3 phosphorylation evaluation of the growth inhibitory activities of triphala against common bacterial isolates from hiv infected patients antibacterial efficacy of triphala against oral streptococci: an in vivo study antibacterial potential of the three medicinal fruits used in triphala: an ayurvedic formulation dissection of mechanisms of chinese medicinal formula realgar-indigo naturalis as an effective treatment for promyelocytic leukemia a network study of chinese medicine xuesaitong injection to elucidate a complex mode of action with multicompound, multitarget, and multipathway phytochemical and pharmacological review of da chuanxiong formula: a famous herb pair composed of chuanxiong rhizoma and gastrodiae rhizoma for headache in silico analysis and experimental validation of active compounds from fructus schisandrae chinensis in protection from hepatic injury discovery of molecular mechanisms of traditional chinese medicinal formula si-wu-tang using gene expression microarray and connectivity map biology-oriented synthesis a network pharmacology approach to evaluating the efficacy of chinese medicine using genome-wide transcriptional expression data identifying roles of "jun-chen-zuo-shi" component herbs of qishenyiqi formula in treating acute myocardial ischemia by network pharmacology network-based global inference of human disease genes the study on the material basis and the mechanism for anti-renal interstitial fibrosis efficacy of rhubarb through integration of metabonomics and network pharmacology a systems biology-based approach to uncovering the molecular mechanisms underlying the effects of dragon's blood tablet in colitis, involving the integration of chemical analysis, adme prediction, and network pharmacology study on action mechanism of adjuvant therapeutic effect compound ejiao slurry in treating cancers based on network pharmacology alternative medicine. intech network pharmacological research of volatile oil from zhike chuanbei pipa dropping pills in treatment of airway inflammation navigating traditional chinese medicine network pharmacology and computational tools modularity-based credible prediction of disease genes and detection of disease subtypes on the phenotype-gene heterogeneous network small molecule inhibitors of the hpv16-e6 interaction with caspase 8 an integrative platform of tcm network pharmacology and its application on a herbal formula network understanding of herb medicine via rapid identification of ingredient-target interactions dbnei2.0: building multilayer network for drug-neidisease systems pharmacology dissection of the anti-inflammatory mechanism for the medicinal herb folium eriobotryae network pharmacology study on the mechanism of traditional chinese medicine for upper respiratory tract infection a network pharmacology approach to determine active ingredients and rationality of herb combinations of modifiedsimiaowan for treatment of gout a co-module approach for elucidating drug-disease associations and revealing their molecular basis deciphering the underlying mechanisms of diesun miaofang in traumatic injury from a systems pharmacology perspective network pharmacology-based prediction of the multi-target capabilities of the compounds in taohong siwu decoction, and their application in osteoarthritis a network-based analysis of the types of coronary artery disease from traditional chinese medicine perspective: potential for therapeutics and drug discovery therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery multi-target therapeutics: when the whole is greater than the sum of the parts studying traditional medicine that would hopefully get resolved in the future. the major limitations and possible solutions are listed:1. nep currently relies on various databases for literature and bioactive mining. databases, though curated, may show discrepancies due to numerous sources of information, theoretical, and experimental data. moreover, the botanicals that undergo certain preparatory procedures during the formulation of the medicine may have its constituents that have chemically changed due to the procedures; like boiling, acid/ alkali reactions, interactions between the bioactives, etc. a way to navigate around this problem is to make use of modern, high-throughput chemical identification techniques like ultra-performance liquid chromatogra-phyàelectrospray ionizationàtandem mass spectroscopy (uplc-esi-ms/ms). this technique will help to identify the exact bioactives or the chemical constituents of the formulation, and will enrich the subsequent key: cord-346606-bsvlr3fk authors: siriwardhana, yushan; gür, gürkan; ylianttila, mika; liyanage, madhusanka title: the role of 5g for digital healthcare against covid-19 pandemic: opportunities and challenges date: 2020-11-04 journal: nan doi: 10.1016/j.icte.2020.10.002 sha: doc_id: 346606 cord_uid: bsvlr3fk covid-19 pandemic caused a massive impact on healthcare, social life, and economies on a global scale. apparently, technology has a vital role to enable ubiquitous and accessible digital health services in pandemic conditions as well as against “re-emergence” of covid-19 disease in a post-pandemic era. accordingly, 5g systems and 5g-enabled e-health solutions are paramount. this paper highlights methodologies to effectively utilize 5g for e-health use cases and its role to enable relevant digital services. it also provides a comprehensive discussion of the implementation issues, possible remedies and future research directions for 5g to alleviate the health challenges related to covid-19. the recent spread of coronavirus disease (covid19) due to severe acute respiratory syndrome coronavirus 2 (sars-cov-2) [1] has caused substantial changes in the lifestyle of communities all over the world. by the end of june 2020 at the time of this writing, over eleven million positive cases of covid-19 were recorded, causing over 500,000 deaths. countries have been facing a number of healthcare, financial, and societal challenges due to the covid-19 pandemic. overwhelmed healthcare facilities due to rapid growth of new covid-19 patients, are experiencing interruptions in provision of regular health services. moreover, healthcare personnel are also becoming vulnerable to covid-19 and this is taxing the healthcare resources even more. to cease the wide spread of the virus, governments impose strict restrictions and control on travel within and between countries, negatively affecting the economies. while the remote work was considered as an alternative with limitations, certain jobs became obsolete. the increased unemployment is a burgeoning problem even for strong economies. apart from that, government expenditure on unemployed workforce, losing income from sectors associated with tourism such as airlines, hotels, local transport, and entertainment were major challenges for the economies. governments had to introduce new guidelines on social distancing to prevent the spread of the virus. this resulted in closing schools, isolating cities and even restricting public interactions, affecting the regular lifestyle of people. such disruption could lead to unprecedented _______________________ *corresponding author email addresses: yushan.siriwardhana@oulu.fi (yushan siriwardhana), gueu@zhaw.ch (gürkan gür), mika.ylianttila@oulu.fi (mika ylianttila), madhusanka.liyanage@oulu.fi, madhusanka@ucd.ie (madhusanka liyanage) consequences such as losing physical and mental well-being. maintaining the societal well-being during the covid-19 era is therefore a daunting task. the technological advancement is one of the key strengths in the current era to overcome the challenging circumstances of covid-19 outbreak. the timely application of relevant technologies will be imperative to not only to safeguard, but also to manage the post-covid-19 world. the novel ict technologies such as internet of things (iot) [2] , artificial intelligence (ai) [3] , big data, 5g communications, cloud computing and blockchain [4] can play a vital role to facilitate the environment fostering protection and improvement of people and economies. the capabilities they provide for pervasive and accessible health services are crucial to alleviate the pandemic related problems. 5g communications present a paradigm shift from the present mobile networks to provide universal high-rate connectivity and a seamless user experience [5] . 5g networks target delivering 1000x higher mobile data volume per area, 100x higher number of connected devices, 100x higher user data rate, 10x longer battery life for low power massive machine communications, and 5x reduced end-toend (e2e) latency [6] . these objectives will be realized by the key technologies such as mmwaves, small cell networks, massive multiple input multiple output (mimo) and beamforming [7] . by utilizing these technologies, 5g will mainly support three service classes i.e. enhanced mobile broadband (embb), ultra reliable and low latency communication (urllc) and massive machine type communication (mmtc). the novel 5g networks will be built alongside fundamental technologies such as software defined networking (sdn), network function virtualization (nfv), multi-access edge computing (mec) and network slicing (ns). sdn and nfv enable programmable 5g networks to support the fast deployment and flexible management of 5g services. mec extends the intelligence to the edge of the radio network along with higher processing and storage capabilities. ns creates logical networks on a common infrastructure to enable different types of services with 5g networks. these 5g technologies will enable ubiquitous digital health services combating covid-19, described in the following section as 5g based healthcare use cases. however, there are also implementation challenges which need to mitigated for efficient and high-performance solutions with wide availability and user acceptance as discussed in section 3. in this work, we elaborate on these aspects and provide an analysis of 5g for healthcare to fight against the covid-19 pandemic and its consequences. capabilities of 5g technologies can be effectively utilized to address the challenges associated with covid-19 presently and in the post covid-19 era. existing healthcare services should be tailored to fit the needs of covid19 era while developing novel solutions to address the specific issues originated with the pandemic. in this section, the paper discusses several use cases where 5g is envisaged to play a significant role. these use cases are depicted in figure 1 and the technical requirements of use cases are outlined in table 1 . telehealth is the provision of healthcare services in a remote manner with the use of telecommunication technologies [8]. these services include remote clinical healthcare, health related education, public health and health administration, defining broader scope of services. telemedicine [9] refers to remote clinical services such as healthcare delivery, diagnosis, consultation, treatment where a healthcare professional utilizes communication infrastructure to deliver care to a patient at a remote site. telenursing refers to the use of telecommunication technologies to deliver nursing care and conduct nursing practice. telepharmacy is defined as a service which delivers remote pharmaceutical care via telecommunications to patients who do not have direct contact with a pharmacist. (e.g. remote delivery of prescription drugs). telesurgery [10] allows surgeons to perform surgical procedures over a remote distance. all these healthcare related teleservices are highly encouraged in post-covid-19 period due to multiple reasons. lack of resources (i.e., hospital capacity, human resources, protective equipment) in healthcare facilities due to existing covid-19 patients, social distancing guidelines imposed by authorities, requirements of maintaining the regular healthcare services adhering to the new guidelines imposed by the healthcare administrations and the need to minimize the risk of healthcare professionals getting exposed to covid-19 are factors motivating teleservices related to healthcare. these teleservices sometimes have strict requirements and call for sophisticated underlying technologies for proper functionality. as an example, a telemedicine followup visit between the patient and the doctor, would require 4k/8k video streaming with low-latency and low jitter. telehealth based remote health education programs should be accessible to the students from anywhere via an internet connection having a proper bandwidth. monitoring the patients via telenursing also requires uninterrupted hd/4k video stream between the patient and the nurse. remote delivery of drugs is possible via unmanned ariel vehicles (uav), which requires assured connectivity with the base station to send/receive control instructions without delays. extreme use cases like telesurgery requires ultra-low latency communication (less than 20 ms e2e latency) between the surgeon and the patient, connectivity between number of devices such as cameras, sensors, robots, augmented reality (ar) devices, wearables, and haptic feedback devices [11] . the future 5g networks will use the mmwave spectrum, which leads to the deployment of ultra-dense small cell networks, including the network connectivity for indoor environments. technologies like massive mimo combined with beamforming will contribute for providing extremely high data rates for large number of intended users. these technologies together provide a better localization for indoor environments [12] . these 5g technologies realize the embb service class which facilitates the transmission of 4k/8k videos between the healthcare professional and the patient, irrespective of the location of access. the new radio access technology developed by the 5g networks, also known as 5g new radio (nr) supports urllc. the urllc service class helps to realize the ultra-low latency requirements of telesurgery applications. a local 5g operator (l5go) has its core and access network deployed locally on premises, serves the healthcare facility with multiple base stations deployed both outdoors and indoors to provide connectivity for case specific needs. this deployment is beneficial for telesurgery use case to achieve ultra-low latency, given that there is a requirement of surgeon and patient being in separate rooms due to the pandemic situation. mec servers deployed at the 5g base stations can be utilized to deploy the control functions for uavs for proper payload deliveries. the fundamental design changes in 5g networks will enable the communication of large number of iot devices, which usually transfer less data compared to human activities such as streaming. these mmtc services provide the support to 5g enabled medical iots (miots) that can be used to monitor and treat remote patients. mmtc will connect and enable communication between heterogeneous devices into the 5g network so that they can operate in synchronicity. a sensor in a wearable device of the patient can immediately sends a signal to the remote nurse via 5g network so that the nurse can activate a special equipment in the patient's room using the mobile device. the use of 5g technologies in a hospital environment for telehealth use cases is illustrated in figure 2 . the spread of covid-19 disease demands the rapid launching of new healthcare services/applications, change the way present healthcare services are provided [13] , integrate modern tools such as ai and machine learning (ml) in the data analysis process [14] . a new application can collect the data of covid-19 patients from different healthcare centers, upload the data to a cloud server and make the information available to public so that others can rely on the information for different purposes. a live video conferencing based interactive applications which enable healthcare professionals to discuss with patients and help them is another example [15] . other applications would perform regular health monitoring of patients such as followup visits, provide instructions on medical services, and spread knowledge on present covid-19 situation and upto date precautions. the difficulty during the pandemic was that there was a need to automate most of the regular work to minimize the interaction between people and new application development needs were also sudden. this calls for a flexible network infrastructure which supports the development of such applications within a short period of time. in contrast to the present 4g networks, 5g supports the creation of new network services as softwarized network functions (nfs) by utilizing sdn and nfv technologies. these nfs can be hosted at the cloud servers, operator premises, or in the edge of the network based on the application demands. mec servers equipped with storage and computing power and reside at the edge of the radio network, will be a suitable platform to host these applications. the deployment of such applications will be more flexible in 5g networks because of the sdn and nfv. bringing the nfs towards the edge eliminates the dependency of the infrastructure beyond the edge, making the applications more reliable. increasing the capacity of the 5g network is much easier because the network itself is programmable. 5g networks are capable of deploying network slices which create logical networks to cater the services with similar type of requirements such as iot slice and low latency slice, thereby serving applications with guaranteed service levels. a surge in demand for personal protective equipment (ppe), ventilators and certain drugs was observed at the beginning of the covid-19 spread, causing an imbalance of the regular supply chains [16] . manufacturing plants were unable to maintain the regular production due to the shortage of raw materials and labor force, therefore they were not capable of responding to the increased demand for the goods. the supplies of finished products were also delayed due to transport restrictions and there were no proper alternative distribution mechanisms so that the people who are really in need would receive them. n95 masks, hand sanitizers, and regular medicine are some of the goods where this imbalance of supply was often seen. those who reacted quickly could stock items in surplus while others who are in need did not receive them. donations to the victims were not always distributed in a fair manner because of the absence of centralized management systems. delivery of the items to the final consumer was a concern due to the risk of covid-19 spread and the restrictions imposed by the authorities to limit the physical contact. it is a challenge for the governments, healthcare authorities, distributors to implement proper mechanisms to manage the supply chains of healthcare items in the covid-19 period. to address the issues in healthcare related supply chains, industries can adopt smart manufacturing techniques equipped with iot sensor networks, automated production lines which dynamically adapt to the variations in demand, and sophisticated monitoring systems. iot based supply chains could be used to properly track the products from the manufacturing plant to the end consumer, i.e. connected goods. uav based automated delivery mechanisms are specially suited in the covid-19 situation to deliver medicine, vaccines, masks to the end consumer minimizing the physical contact. 5g supports direct connectivity for iot and mmtc between iot devices. this will fuel the possibility to use large amount of iot devices to increase the efficiency of supply chains. deployment l5gos to serve the needs of industries is a better way to integrate iot sensors, actuators, robots directly into 5g network enabling a 5g based smart manufacturing system. the proper network connectivity for the sensors, actuators, robots in the manufacturing plants will be enabled by the mmwave 5g small cells deployed indoors. massive mimo will provide connectivity for a large number of devices and beamforming technique ensures a better quality of the network connection. the direct connectivity of goods into the 5g systems makes the supply chains more transparent. mec integrated with 5g, can be used to process the data locally to improve the scalability of the systems as well as security and privacy of collected data. moreover, mec integrated with 5g can easily be used to implement decentralized solutions via blockchain [17] , [18] . the delivery of items to the final destination can be performed via beyond lineof-sight (blos) uav guided by the 5g network. this could minimize unnecessary interactions in covid-19 period and reduce human efforts. real-time data is available for the authorized users for monitoring and tracking, which increases the transparency of the operation. covid-19 positive patients with mild conditions are usually advised for self-isolation to prevent further spread. while self-isolation is a better alternative to manage the capacity of healthcare facilities, the self-isolating individuals should be properly monitored to make sure that they follow the self-isolation guidelines. the challenge is to track every movement of the patient, which is currently impossible. in an event of a violation of self-isolation guidelines, control instructions should be sent. mobile device based selfisolation monitoring is possible via an application which sends random gps data of patient's mobile phone to a cloud server. wearable devices attached to the patient's body use their sensors to measure the conditions of the patient and upload the data via the mobile phone. uav based solutions can monitor the conditions of the patients from a distance. uavs can monitor body temperature via infrared thermography and identify the person via face recognition algorithms. moreover, contact tracing of identified positive cases is extremely important [19] . however, present contact tracing mechanisms involve significant human engagement and consist of a lot of manual work. this prevents the identification of all the possible close contacts and hinders the effectiveness of the contact tracing. manual tracing does not guarantee that all the possible close contacts are identified. bluetooth low energy (ble) based contact tracing applications use ble wearable devices, which advertise its id periodically so that other compatible devices in close proximity can capture the id and store with the important details such as timestamp, gps location data (optionally). once an infected covid-19 patient is detected, the ble solution provides the ids of the close contacts over a defined period. ble based solutions identify the contacts in the range of few meters, whereas pure gps based solutions do not have that accuracy. role of 5g mmtc in 5g is responsible for massive connectivity of heterogeneous iot devices such as sensors, wearables, and robots. the small cell networks equipped with mimo and beamforming in 5g will ensure better connectivity and positioning including indoor environments. hence, iot devices directly connected to 5g network can be effectively used to monitor the compliance of self-isolation. instead of using general mobile device data, the patients can be attached with a low power wearable devices which transfer data via ble technology. those sensory data can be updated to the cloud via the 5g network and the authorized parties can monitor the behavior of the patient. a similar concept can be applied to contact tracing where the wearable ble devices collect data of nearby devices and upload to the cloud via 5g network. once a patient is tested positive, all the close contact details are already in the cloud and they are notified for proper safety measures such as self-isolation. mec servers deployed at the 5g base stations are useful to increase the scalability of the operation as the resource demand increases. allocating a separate network slice for contact tracing data transfer is a better approach to assure the quality of service (qos) and strengthen the privacy and security of the data. despite the use-cases for 5g concerning healthcare and the fight against covid-19, there are also imminent challenges ranging from technical ones such as scalability to socio-economic ones including technology acceptance. the impact of pertinent deployment challenges on each use case is depicted in table 2 . j o u r n a l p r e -p r o o f journal pre-proof a video recording of a telemedicine session may contain personal information which the patient would like to disclose only to the doctor. in addition, automated contact tracing applications aggregate sensitive location data without the owners' knowledge. sharing such sensitive user data with unauthorized parties such as third-party advertisers is a serious privacy violation [27] . in addition, privacy protection is a legal requirement, which is posed by various legal frameworks such as gdpr [28] and health insurance portability and accountability act (hipaa) [29] . to address the privacy challenge, solutions like privacy by-design [30] , software defined privacy [31] have to be deployed with 5g health applications already at the design phase. privacy-by-design relies on the notion that that data controllers and processors should be proactive in addressing the privacy implications of any new or upgraded system, procedure, policy or data-sharing initiative, not at the later stages of its life-cycle, but starting from its planning phase [32] . the developed e-health solutions in 5g should consider the entire life-cycle of health data when protecting to protect privacy, access control methods managing how different parties access information are necessary. edge computing is beneficial to minimize data transmissions through different network elements and enable local processing, improving privacy aspects [33] . furthermore, users of e-health technology should be made fully aware of what they are consenting to regarding data sharing and processing when they are using such digital solutions. similarly, transparency in the form of informing users of potential privacy risks are effective to improve the adoption of e-health solutions [34] . attempts by adversaries to attack the databases containing sensitive information pose security risks. the importance of e-health systems exacerbates the impact of attacks on the availability requirement. the integration of miot increases security risks of healthcare systems. such low-end devices are comparably easy to hack and vulnerable to denial-of-service (dos) attacks. massive amount of connected devices increases the number of entry points for attackers to perform unauthorized operations, i.e. increases the attack surface, on the healthcare system [35] . lightweight and scalable security mechanisms must be designed to secure miots. adequate security mechanisms are crucial to address the limited capabilities of constrained sensors, as well as the additional vulnerabilities if part of the security functions are offloaded to the cloud. for the digital health services, widespread automation, data analytics and smart control requires ml and ai techniques in 5g systems. encrypted data transmission and distributed security solutions such as blockchain can prevent attackers gain access to the network and protect the collected user data of different premises. the employed security mechanisms and algorithms should support continuous updates with minimal effort to adapt to discovered vulnerabilities and emerging security threats. regime a rapid deployment of new healthcare applications will add extra traffic as well as increase the number of 5g users who access such services. this will lead to increased network congestion. as an example, ar based applications used in telemedicine require high bandwidth and low latency. however, a congested network fails to satisfy the service levels for such applications. moreover, it is challenging to manage billions of miots. when a large number of iot devices generate ad-hoc data transfers, the network should be scalable to cope with the increased number of traffic events. the small data characteristics and intermittent connectivity of iot encumber the medium access and physical layers of access networks serving ehealth applications. ns in 5g with dynamic scalability is a possible solution to address this problem. the slices serve similar type of services and they can be made adaptive based on the various parameters such as priority of the service, present network traffic, available network resources, qos requirement, number of iot devices presently connected [36] . deployment of virtual nf based on demand at the mec servers will provide a solution to the congestion due to sudden increase of localized demands. for improving scalability, edge computing systems and distributed clouds can perform visual processing on large computational capabilities like gpus and transmit the audiovisual outputs enriched with analytics results to mobile e-health devices. in this way, the impact from device limitations is elastically minimized while congestion towards core network is also mitigated. regarding the physical layer, phy techniques such as full beamforming technologies using a large number of antenna elements increase scalability, high frequency utilization efficiency and high-speed communication. network operators need to deploy these 5g based solutions as soon as possible. the limited deployment of 5g networks and limited availability of 5g devices will be an immediate problem for many countries. undoubtedly, the 5g proliferation is expected to be gradual in terms of network connectivity and capacity. the complexity and implementation issues of 5g devices including power consumption due to high frequency transmissions as well as multi-band support of upper and lower frequency bands complicate the device cost and production challenges. governments and networks operators should push forward their deployment plans. moreover, small scale 5g deployments such as l5go networks [37] should be encouraged to use in hospitals, manufacturing plants [38] . purpose-built iot devices with a smaller but targeted capabilities for e-health use-cases can alleviate the complexity and cost issues regarding the deployment and commissioning of 5g systems. from the business perspective, offering a discount to mobile operators bidding in spectrum auctions in exchange for an improved coverage commitment can expedite the 5g deployment. for improving coverage in poorly served areas, some spectrum bands can be shared by different network providers. from the cost minimization perspective, ran sharing allows multiple operators to use the same radio access infrastructure and enables an easier coverage expansion for 5g. incidents such as destroying the cellular base stations [39, 40] due to conspiracy theories linking new 5g mobile networks and the covid-19 pandemic [41] , disrupts connectivity affecting the applications. however, network j o u r n a l p r e -p r o o f journal pre-proof connectivity and service continuity are critical for connected e-health solutions. 5g solutions may require the user to possess sophisticated level of technical literacy. however, many people lack such level of technical literacy. the provided ease of use is an important factor that supports or inhibits the implementation of e-health systems. health personnel is deterred from or resistant to using such new systems with additional complexity to their workflows, or requiring additional effort/time [42] . furthermore, 5g devices are significantly more expensive, leading to a cost burden on users. experts and media have responsibility to clear out these inaccurate social beliefs with the support of civil society and governments. the applications can be made easier to use and to execute on average hardware and devices so that everyone can afford it and use the services. for e-health solutions supporting physician-patient interaction, an effective clinical decision support system must minimise the effort required by clinicians to receive and act on system recommendations. this requirement is extended to include ease of use for patients and their family members and other service users, or even health professionals be-sides clinicians, such as nurses [42] . solutions for remote monitoring, contact tracing will result in legal issues unless the sensitive personal data is not properly handled. examples are contact tracing after the patient is recovered from covid-19, collecting and storing unnecessary data from the personal devices. since access to healthcare is a right, if the technical solutions prevent people from obtaining timely healthcare or cause wrong diagnosis/treatment, that is an issue concerning fundamental rights. 5g-enabled smart devices for e-health will have a far reaching impact on manufacturers, service companies, insurers and consumers. such a situation could also lead to legal issues. adhering to the policies defined by standardization bodies such as eu statement on contact tracing [43] prevents legal issues. standardisation and regulation must cover the whole range of healthcare technology chain from medical device technologies to software technologies, including sensors. obtaining legal advice before the deployment of different applications would also prevent the future legal issues. the traditional product liability limited to the form of tangible personal property should be extended to the correct functioning of network and services in e-health solutions. this is more challenging due to the complex environment of 5g. therefore, root-cause analysis techniques and pervasive monitoring functions are important [35] . healthcare sectors of the countries were the first to affect due to the spread of covid-19 disease, facing numerous challenges. as the countries now have control mechanisms in place to minimize the spread of covid19, they are reopening the economies so that the public can resume their regular lifestyle. to prevent any "re-emergence" of the disease, healthcare sectors of each country must be equipped with novel solutions to address any emerging j o u r n a l p r e -p r o o f challenges effectively. to this end, 5g technologies are crucial. 5g utilizes mmwave frequencies of the radio spectrum with small cell base stations which will provide better connectivity including indoor environments via its nr. massive mimo combined with beamforming will serve a large number of 5g devices/users with guaranteed data rates. these technologies deliver embb, urllc and mmtc service classes which enable the development of different types of services using 5g networks such as ar, uav communication, and collaborative robots. together with 5g, mec and ns will improve flexibility, scalability, guaranteed service levels and security for the applications. hence, solutions developed using 5g technologies serve various health related use cases such as telehealth, supply chain management, self-isolation and contact tracing, and rapid health services deployments. however, a wide range of implementation challenges such as privacy/security, scalability, and societal issues should be addressed before deploying such applications with full functionality. severe acute respiratory syndrome coronavirus 2 (sarscov-2) and corona virus disease-2019 (covid-19): the epidemic and the challenges smart home-based iot for real-time and secure remote health monitoring of triage and priority system using body sensors: multidriven systematic review role of biological data mining and machine learning techniques in detecting and diagnosing the novel coronavirus (covid-19): a systematic review a proposed solution and future direction for blockchain-based heterogeneous medicare data in cloud environment five disruptive technology directions for 5g scenarios for 5g mobile and wireless communications: the vision of the metis project what will 5g be? how about actively using telemedicine during the covid-19 pandemic? m-health solutions using 5g networks and m2m communications transformation in healthcare by wearable devices for diagnostics and guidance of treatment single-and multiple-access point indoor localization for millimeter-wave networks realtime smart patient monitoring and assessment amid covid19 pandemic-an alternative approach to remote monitoring ai-driven tools for coronavirus outbreak: need of active learning and cross-population train/test models on multitudinal/multimodal data design and develop a video conferencing framework for realtime telemedicine applications using secure group-based communication architecture lessons from operations management to combat the covid-19 pandemic the role of blockchain in 6g: challenges, opportunities and research directions, in: 2020 2nd 6g wireless summit (6g summit) how can blockchain help people in the event of pandemics such as the covid-19? a flood of coronavirus apps are tracking us. now it's time to keep track of them tactile-internetbased telesurgery system for healthcare 4.0: an architecture, research challenges, and future directions 5g mobile and wireless communications technology survey on multi-access edge computing for internet of things realization 2017 ieee 85th vehicular technology conference the efficacy of contact tracing for the containment of the 2019 novel coronavirus (covid-19) 5g technology for augmented and virtual reality in education telepharmacy services: present status and future perspectives: a review for telehealth to succeed, privacy and security risks must be identified and addressed eu data protection rules department of health & human services, health insurance portability and accountability act of 1996 (hipaa a systematic literature review on privacy by design in the healthcare sector 2016 ieee international conference on cloud engineering workshop (ic2ew) privacy by design: informed consent and internet of things for smart health privacy techniques for edge computing systems first, design for data sharing inspire-5gplus: intelligent security and pervasive trust for 5g and beyond networks dynamic network slicing for multitenant heterogeneous cloud radio access networks micro operators to boost local service delivery in 5g micro-operator driven local 5g network architecture for industrial internet mast fire probe amid 5g coronavirus claims at least 20 uk phone masts vandalised over false 5g coronavirus claims covid19 and the 5g conspiracy theory: social network analysis of twitter data 5g-ppp white paper on ehealth vertical sector statement on essential principles and practices for covid-19 contact tracing applications this work is partly supported by the european union in response 5g (grant no: 789658) and the academy of finland in 6genesis (grant no. 318927) key: cord-327401-om4f42os authors: bombelli, alessandro title: integrators' global networks: a topology analysis with insights into the effect of the covid-19 pandemic date: 2020-08-11 journal: j transp geogr doi: 10.1016/j.jtrangeo.2020.102815 sha: doc_id: 327401 cord_uid: om4f42os in this paper we propose, to the best of our knowledge, the first analysis of the global networks of integrators fedex, ups, and dhl using network science. while noticing that all three networks rely on a “hub-and-spoke” structure, the network configuration of dhl leans towards a multi-“hub-and-spoke” structure that reflects the different business strategy of the integrator. we also analyzed the robustness of the networks, identified the most critical airports per integrator, and assessed that the network of dhl is the most robust according to our definition of robustness. finally, given the unprecedented historical time that the airline industry is facing at the moment of writing, we provided some insights into how the covid-19 pandemic affected the global capacity of integrators and other cargo airlines. our results suggest that full-cargo airlines and, much more dramatically, combination airlines were impacted by the pandemic. on the other hand, apart from fluctuations in offered capacity due to travel bans that were quickly recovered thanks to the resilience of their networks, integrators seem to have escaped the early months of the pandemic unscathed. air cargo transportation plays a role of paramount importance in the global economy, especially when time and safety are crucial factors. in fact, while roughly 1% of the overall cargo volume worldwide is transported via air, the percentage spikes to 35% if value is used as a measure (iata website, 2020). as example, transport of high-value, perishable, or emergency-related products is generally carried out via air, because it is the only mode that guarantees shipping times consistent with the user's requirements and needs. at the time of writing, this factor is even more crucial because of the covid-19 pandemic. transport of lifesaving medical devices (e.g., ventilators) and masks to help people worldwide contrast the disease has been possible only with air transport (aviation business website, 2020). air cargo transport can be carried out in two ways: (i) in the belly space of passenger aircraft, and (ii) using dedicated full freighter aircraft. the first option offers more flexibility in terms of frequencies and destinations, but a limited cargo capacity per aircraft. this cargo capacity can also suffer from unexpected variations, because it depends on how much luggage passengers check in for a specific flight (morrell and klein, 2018; delgado et al., 2020) . given the aforementioned two transport strategies, three different air cargo service providers can be identified: (i) passenger airlines offering cargo services (also known as combination airlines), (ii) full-cargo airlines, and (iii) integrators. differently from (i) and (ii), that only offer air transport services between airports and rely on freight forwarders and ground handlers for the landside logistics, integrators offer a door-to-door service to customers. the american fedex and ups, and the european dhl are the "threeheaded" kings of the integrator business worldwide. tnt, another important integrator in the past, was acquired by fedex in 2016. on the other hand, amazon has recently invested into the creation of its own aircraft fleet, i.e., amazon air, hence paving the way to become de facto a fourth major integrator. while, until now, all its fleet was leased from other cargo airlines, by 2021 amazon air will own more than 70 full freighters (the motley fool website, 2020) . similarly to other transportation modes, network science can be used to study characteristics, similarities and differences between the networks of the different players in the air cargo world, and specifically of integrators. this approach can provide useful insights into expansion opportunities or network re-structure strategies in such a competitive business. as example, the analysis of which airport connection, if added to the current network structure, might be more beneficial for the overall connectivity, can be of interest for stakeholders, as well as the effect of a temporary (or permanent) closure of an airport. while the literature is relatively rich of works addressing the network structure of airlines using a passenger perspective (guimera et al., 2005; malighetti et al., 2008; paleari et al., 2010; lordan et al., 2014) , the cargo counterpart is still a fairly unexplored territory, especially when it comes to integrators. to the best of our knowledge, academic papers only focused on basic properties of integrators' networks (bowen, 2012; bombelli et al., 2020) , or were spatially and temporally limited to subsets of the global networks (malighetti et al., 2019a; malighetti et al., 2019b) . this factor is consistent with the difficulty to find reliable and complete data on air cargo operations, where confidentiality and competition are crucial factors. this applies both to passenger airlines offering cargo services and, even more strongly, to integrators (malighetti et al., 2019a; lakew, 2014) . our first contribution fills in this gap. using publicly available data from global aviation data services over a period of eight months, we built global networks for integrators fedex, ups, and dhl, and provided a thorough analysis and comparison of such networks. the second contribution contextualizes the peculiar historical period that coincides with the preparation of this paper. while the covid-19 pandemic inflicted an unprecedented blow on passenger airlines (city am website, 2020), the effect on the cargo industry was evident (accenture website, 2020), yet not so dramatic. as mentioned before, global transport of goods was needed ever more during the pandemic, and lockdown flight restrictions and bans on passengers did not apply with the same severity to cargo schedules. given that the dataset we collected refers to a time-span that covers a pre-and a pandemic period, we analyzed how network characteristics and connectivity evolved with time for the three integrators and, to have a more thorough analysis, for three other airlines relevant from a cargo perspective. the rest of the paper is organized as follows. in section 2, a literature review on network science applied to the cargo industry, and specifically to integrators is provided. section 3 describes the characteristics, assumptions, and limitations of the collected dataset. in section 4, a network analysis in terms of topology and robustness is shown for integrators fedex, ups, and dhl. section 5 describes the effect of the covid-19 pandemic on the network characteristics of different cargo carriers, while section 6 states conclusions and recommendations for future work. the existing literature pertaining integrators mainly addresses two aspects: (i) their business and cost models, and (ii) their network configuration and characteristics. although our work belongs to the second category, we argue that the two categories are strongly intertwined, and hence provide a comprehensive literature review addressing both. as it concerns business and cost models, in (kiesling and hansen, 1993) the cost structure of fedex between the late 80s and early 90s was analyzed, and considerable economies of densities were highlighted. it should also be noted that each of the integrators considered in this work has undergone massive changes, acquisitions, and network re-designs in the last thirty years, due to the soaring of the internet and the e-commerce, as highlighted in (morrell and klein, 2018) and (lakew, 2014) . in addition, in (lakew, 2014) the cost structure of fedex and ups was assessed using quarterly data on domestic operations and costs for the years 2003-2011. it was shown that (i) accounting for carrier-specific differences in cost structure and network size, fedex is more cost efficient than ups, and (ii) both integrators display economies of size. the latter result was also confirmed by (onghena et al., 2014) . before analyzing the relevant literature on the network configuration and characteristics of integrators, we provide a general framing of complex network theory. the term complex network refers to those networks whose topological characteristics are non-trivial, with patterns and relationships between nodes that would generally not occur in a randomly generated network (barabási and albert, 1999 , barabási, 2009 , strogatz, 2001 . many systems where hierarchical and community-like structures between elements are present can be modeled as complex networks. systems of this kind are, as example, the internet (cohen et al., 2000) , epidemic spreading models (stegehuis et al., 2016) , and transport networks such as air transport networks (guimera et al., 2005) . our work belongs to this last category. focusing on papers addressing the network configuration of integrators from a quantitative perspective, (kuby and gray, 1993) analyzed the fedex network. in contrast to the general research on huband-spoke systems, where it is assumed every node has a direct connection to the hub, it was shown that in the fedex network most routes to the main hub make one or more stopovers. the paper explores the trade-offs and savings involved with stopovers and feeders, and evaluates the optimality of the fedex network using a mixed integer linear programming formulation. more recently, (bowen, 2012) provided a comparison of the network structures of fedex and ups with the network structure of american airlines and southwest using complex network theory indicators. although this work is the first one, to the best of our knowledge, to provide such analysis, we believe the comparison between passenger airlines and integrators' networks might not be totally appropriate. in fact, while in the former demand is generally symmetric, in the latter there is a high imbalance in demand and many routes are unidirectional (and sometimes dubbed as "triangular"). as such, modeling integrators' networks with undirected edges and, as a consequence, using network indicators that do not consider directionality of connections (as the γ index in the paper) might lead to biased results. this issue was addressed in (malighetti et al., 2019a; malighetti et al., 2019b) and (bombelli et al., 2020) , where the analysis of integrators' networks was carried out considering the directionality of connections. in (malighetti et al., 2019a) and (malighetti et al., 2019b) the authors focused, respectively, on the european and asian network structure of fedex, ups, dhl, and tnt (the analysis covers a time-period prior to the fedex acquisition), which are based on a limited temporal dataset of one week. in both works, the different strategies of the main integrators were highlighted, with dhl focusing more on efficiency and exhibiting the highest centralization and the lowest density, and fedex and ups showing a higher density and transitivity and a lower centralization. in (bombelli et al., 2020) , the authors relied on a temporally larger dataset and focused on the global networks of fedex, ups, and dhl. after some preliminary considerations on the different networks, that confirmed the different characteristics of the networks at the global scale, most of the complex network analysis addressed the global air cargo transport network that also includes passenger airlines. since all the different networks were merged, integrator-specific insights were not traceable any longer. hence, in comparison to the existing literature on integrators, the contribution of this paper is twofold as already anticipated in section 1. first, we provide a topological analysis of integrators' networks that is both spatially and temporally more complete. second, we add a robustness analysis that, after addressing the general structure of the networks, focuses on time-dependent variations due to the covid-19 pandemic. in this section, we provide a thorough description of the dataset used in the paper. as highlighted in other works (malighetti et al., 2019a; malighetti et al., 2019b) , data on integrators' schedules is scarce and difficult to retrieve as a stand-alone product. to circumvent this issue, we have been collecting integrator-specific data from public sources for a time period of eight months. in particular, we retrieved data from global aviation data services flightaware 1 and flightradar24, 2 which report for all airports in their database departures of the previous 14 days and 7 days, respectively. we used flightaware as the main data source, and used data from flightradar24 to enrich our dataset by adding flights that might not have been included in the flightaware database. since the process was carried out either via the dedicated api, with a limited set of requests (flightaware), or manually (both data services), we pre-selected a set of 336 airports deemed relevant from a cargo perspective (i.e., with an annual cargo throughput in 2014 greater or equal to 5000 t (meijs, 2017) ) to limit the data retrieval process effort. all the main and second-tier hubs of the three integrators were considered, as well as other airports with a non-negligible yearly cargo throughput. as will be highlighted later in this work, this airport filtering step can have the undesired effect of ruling out some lower-tier airports used by the integrators. notwithstanding, we believe our approach to provide a good trade-off between computational effort and faithful network representation. we also acknowledge that, while the dataset could be more extensive, cargo operations generally rely on a smaller set of involved airports when compared to the passenger counterpart. because of the extensive freight ground transportation network, the catchment area for cargo transport can increase up to 10 times with respect to the catchment area for passengers (boonekamp and burghouwt, 2017) . the list of the considered airports is available as part of our online dataset (https://data.4tu.nl/repository/ uuid:2e9b04dd-70fe-4f16-abd4-873be4b2c4b1), while their geographical location is depicted in fig. 1 . in the rest of the paper, we will generally refer to specific airports using their iata code unless, for sake of clarity, the full name is preferable. we collected data on dates november 20th 2019, december 3rd 2019, december 16th 2019, december 30th 2019, january 13th 2020, january 27th 2020, february 14th 2020, march 2nd 2020, april 6th 2020, april 27th 2020, may 11th 2020, may 26th 2020, june 18th 2020 and will name each of these thirteen data retrieval blocks an observation for the rest of the paper. we additionally retrieved data from flightradar24 seven days prior to each observation, so that two consecutive data retrieval processes from this data source would match the time-span of each observation from flightaware. given our notation, each observation's retrieval date refers to the end of the time-period that observation covers. as example, an observation with date april 27th, 2020 refers to the time-period april 13th-april 27th, 2020. the first six observations cover consecutive 14-day periods. apart from negligible temporal holes, due to the fact we did not always start the data retrieval process at the same time, this means that roughly half of our dataset covers 84 consecutive days. this is important to ensure that triangular routes that were flown infrequently are considered. since the process was carried out manually and each data retrieval is generally lengthy, the other seven observations were retrieved with time-intervals ranging from 17 to 25 days. nothwithstanding the presence of more pronounced temporal holes in this case, the addition of these observations is important to account, at least partially, for seasonality effects. in fact, our dataset contains the peak season (november and december) and another above average month (march), below average months (january, february, and april), and average months (may and june). while we acknowledge that seasonality effects can only be fully accounted for with a complete year of data, we believe our dataset to be sufficient, especially when compared to the existing literature, to provide useful insights into integrators' global networks. all thirteen observations were used both to build the networks described in section 4, and to create time-series in section 5 to analyze the effect of covid-19 on cargo capacity and network indices. overall, our dataset covers 182 days between the second half of 2019 and the first half of 2020. for each observation, we created 336 distinct (airport,date) tuples, each containing departures from a specific airport in the 14-day period culminating in the date specified in the tuple. an extract from the data associated with the (hong kong international airport (hkg), april 27th, 2020) tuple is reported in table 1 . each entry is characterized by a flight code, an aircraft code, a destination airport, a departure time, and an estimated time of arrival (eta). each observation was then split into specific origin-destination (od) airport pairs. having 336 airports and thirteen observations, a maximum number of 13 ⋅ 336 ⋅ 335 = 1,463,280 distinct tuples characterized by a unique (od airport pair,date) was generated. for each tuple, flights specific to fedex, ups, and dhl were searched using a list of integrator-specific airlines and aircraft types. we focused on the flight code column of each tuple to identify integrator-specific flights, and looked for airline codes as follows: the aforementioned list highlights a different business strategy between dhl and the other two integrators. in fact, while fedex and ups mainly rely on their own fleet, dhl relies on a vast set of partner airlines that are owned/co-owned and generally fly under the dhl livery, as shown in fig. 2 . while most of the listed airlines operate solely for dhl, other airlines might be offering part of their capacity to other freight forwarders. as example, while dhl severely relies on polar air cargo services, especially for u.s.-asian routes, this airline might offer part of its cargo capacity to other forwarding companies such as db schenker or kuehne nagel. as a consequence, we might overestimate the overall capacity for the dhl network. we tried to mitigate this effect as much as we could, for example choosing only aircraft type, tail number (when available), and od airport pairs combinations we knew had a high chance to be uniquely flown for dhl. fedex and ups rely on other airlines as well, but to a much lesser extent or for contingency reasons that cannot be easily traced and recognized in our dataset. fedex relies on a fleet of asl airlines ireland atr 42-300f and atr 72-200f turboprobs for local cargo transport, as example within canada. ups has an agreement with western global airlines to sub-contract five mcdonnell douglas md-11f aircraft for up to 30 days a year for temporary volume spikes. for each integrator, we focused on the aircraft types listed as either belonging to their own fleet, or to subsidiary airlines. we considered all narrow-body and wide-body full freighters, and did not consider turboprops, because they contribute less significantly to the overall capacity and generally only operate in a point-to-point manner between local airports. in appendix a, the full list of aircraft is provided, with model, aircraft code, and maximum transportable payload (in tonnes). the maximum payload, as section 4 will reveal, is a crucial measure in this work, because it is used to compute the theoretical maximum cargo capacity between od airport pairs. we also want to highlight two characteristics of the air cargo network that are strongly related to our modeling choice: 1. full freighters seldom flight at full weight capacity. lacking data on average load factors per od airport pair, we believe that using the maximum theoretical weight capacity is a good indicator of the relevance of an od airport pair connection 2. some flights might be volume-bounded (malighetti et al., 2019b) rather than weight-bounded, which means that their maximum volume capacity is reached before their weight capacity (i.e., they are "low density" flights in jargon). this might be especially true for hightech commodities such as tvs. notwithstanding, we decided to focus on weight capacity because we believe it to be easier to quantify. we conclude the dataset analysis by describing how the maximum transportable tonnage per od airport pair connection and integrator was computed. given, for each od airport pair, the subset of (od airport pair,date) tuples containing recorded flights for that connection and a specific integrator, we summed the transportable payloads of all aircraft involved to determine the maximum weight capacity. as example, if the analysis of the thirteen (a-b,date) tuples provided as cumulative outcome for a specific integrator 50 boeing b747-400f and 75 boeing b747-800f, then the weight capacity of the a-b connection for that integrator was computed as 50 times the payload of a b747-400f plus 75 times the payload of a b747-800f. for sake of clarity, we want to highlight that in the paper we will be using interchangeably the terms od airport pair and route to represent the direct connection between two airports. as such, and not having data on preferred cargo itineraries between airport pairs, weight capacity represents the maximum estimated payload that can be transported along the direct route od, where o might not be the initial origin and d might not be the final destination for some cargo. we begin this section with an overview of the methods that will be used to model and compare integrators' networks in section 4.1. then, in section 4.2 we provide a thorough overview of the topological properties of the fedex, ups, and dhl networks, and conclude with a robustness analysis in section 4.3. we modeled each integrator's network as a directed graph g n = ( , ), where n is the set of nodes (airports), and ℰ is the set of directed edges (connections between od airport pairs). in a graph, edges can be unweighted or weighted. in the first case, they are all assigned a unitary value. in the second case, the weight of each edge should be representative of the relevance of such edge within the graph. in this work the weight of each edge is the maximum transportable tonnage capacity, as underlined in section 3. an unweighted directed graph can be represented in compact form with an adjacency matrix a n n × , where a ij =1 if a directed edge connects nodes i and j. all the nodes j that are directly reachable from node i, i.e., a ij =1, are its neighbors. if the graph is weighted, a is replaced with the weight matrix w , where w ij represents the weight of the directed edge connecting i and j. the indegree of node i is and represents the number of nodes directly connected to node i (i.e., it is the inflow of node i). the outdegree of node i is and represents the number of nodes reachable from node i (i.e., it is the outflow of node i). the summation of the two defines the degree of node i, i.e., k i . for an airport, the degree represents the number of od airport pairs for which the current airport journal of transport geography 87 (2020) 102815 4 is either the destination (indegree) or the origin (outdegree). if we define n k ( ) the number of nodes in the network with a degree equal to k , the cumulative degree distribution expresses the ratio of nodes in the network with a degree greater or equal to k . the strength s i of node i is defined as , i.e., it is the summation of the weights of all edges departing from/arriving to node i. using available capacity as the weight, the strength of an airports provides an estimate of its potential cargo throughput. the normalized local clustering coefficient c i (fagiolo, 2007) of node i in a directed graph is defined as . the normalized betweenness centrality (freeman, 1977) where σ jk is the number of shortest paths between nodes j and k, and σ jk i is the number of those paths passing through node i. the normalization term in eq. (3) accounts for the fact that paths starting or ending in i are not considered. this index correlates the relevance of a node in a network with the frequency the node appears in shortest paths that neither start nor end in i. if an airport is characterized by a high betweenness centrality, it means it is an important transshipment hub for cargo. focusing on network-specific indices, some of them are averages of node-specific indices. as example, the average degree is , while the average clustering coefficient, also known as global clustering coefficient (watts and strogatz, 1998; fagiolo, 2007; opsahl and panzarasa, 2009) . note that we use the notation a to indicate the arithmetic mean of vector a . we define the characteristic path length of a network as where d ij i the number of steps from node i to node j. note that, given eq. (4), we assume an unweighted formulation. we define the diameter d = max (d ij ) as the longest shortest path in the graph. the component of a graph is a subset of the graph where does exist a path between each pair of nodes belonging to the component. the giant componentg c is the component with the highest number of nodes. if a path exists between every node pair in the graph, the graph itself is the giant component. we begin our topology analysis highlighting the basic characteristics of the three networks, as summarized in table 2 . for each integrator, the number of nodes is equivalent to the cardinality of the subset of the 336 airports that appeared in at least one (od airport pair,date) tuple. the same concept was applied to determine the number of edges. similarly to (malighetti et al., 2019a) , we defined the maximum transportable tonnage per edge available freight tonnes (aft), and the product of each aft and the geodesic distance of the associated od airport pair available freight tonnes kilometer (aftk). the three networks are graphically visualized in fig. 3 . note that, for sake of visual clarity, we plotted edges as undirected. as such, the thickness of each edge is proportional to the cumulative aft characterizing the od airport pair connection in both directions. the dhl network is the most developed in terms of number of nodes and edges, probably due to the set of auxiliary airlines that operate under its livery. this more extensive network does not translate into a higher cargo capacity, as the overall aft and aftk testify. in terms of overall capacity, fedex outperforms both ups and dhl. dhl performs better than ups in terms of aftk because of the considerable higher number of connections. analyzing the density, i.e., the ratio between the edges and the potential number of edges of a network (that is, n n 1 for a directed graph), fedex and ups are characterized by a comparable value, with dhl being a sparser and more concentrated network. all three values of reciprocity, i.e., the ratio between the number of node pairs connected in both directions and the number of node pairs connected in at least one direction, are low. this justifies the modeling assumption of relying on a directed graph to model demand and flow imbalances in the cargo network. the global clustering coefficient of fedex is higher than the ones characterizing ups and dhl, meaning that airports in the fedex network are clustered more closely. the three global clustering coefficients, paired with the small values of 〈l〉, ensure that the three networks are small world networks. this means that most airports are not neighbors of one another, but at least a subset of the neighbors of any given airports are likely to be neighbors of each other. in addition, most airports can be reached from every other airport by a small number of hops, generally using hubs as pivot nodes. in terms of diameter, fedex still emerges as a more compact network. note that both 〈l〉 and d were computed with respect to the giant component of each network, to avoid numerical errors. as table 2 shows, for all three networks the giant component g c does not coincide with the full network. we investigated this behavior, and found for each integrator a small number of airports with a unitary degree. being the graph directed, this means that those airports behave only as sinks (unitary indegree) or sources (unitary outdegree). having collected enough data, temporally-wise, to cover possible seasonal or infrequent routes, we attributed this fact to our initial set of airports. most likely, some lower-tier airports that appear in the same circular routes as the unitary degree airports were omitted. a refinement of the initial set of airports is one of the first directions to pursue as part of our future work. to provide more tangible insights into the role of airports and od airport pair connections, for each integrator we list the top-five airports according to degree, strength, and betweenness centrality. we computed betweenness using two approaches, unweighted betweenness g and weighted betweenness g w . in the unweighted formulation, every edge is treated equally and assigned a unitary cost. for the weighted formulation, we used a heuristic approach to translate each aft into a proper cost. for each integrator, we identified the minimum and maximum aft (resp. a m and a m ) over the whole network, and associated with the two values a maximum and minimum cost (resp. c m and c m ). we wanted costs to decrease for increasing values of aft to represent the easiness of use of high-capacity routes over low-capacity routes. we used a linear function in the form where c i is the cost of an edge whose aft is a i . note that given our table 3 . the analysis of table 3 reveals, not surprisingly, that the main hubs of each integrator are generally at the top of the table regardless of the index chosen. for fedex, their global hub memphis international airport (mem), for ups, their worldport worldwide hub louisville international airport (sdf), for dhl, their european hub leipzig/halle airport (lej) and american hub cincinnati/northern kentucky international airport (cvg). for each integrator other major hubs appear at different positions, depending on the index, in the table. as example, charles de gaulle airport (cdg), cologne bonn airport (cgn), kansai international airport (kix) for fedex, cgn, hong kong international airport (hkg), miami international airport (mia), ontario international airport (ont), philadelphia international airport (phl) for ups, bahrain international airport (bah) for dhl. intuitively, american integrators fedex and ups pair their main domestic hub (mem and sdf, respectively) with a european hub (cdg and cgn, respectively) to cover their second biggest market. european integrator dhl relies on a similar strategy, with the main hub lej paired with the american hub cvg. note that all european hubs are located inside or in close proximity to the so called "blue banana" industrialized region, that offers a catchment area rich of industries and retailers. on the other hand, all american hubs belong to the midwest, which offers a strategic geographical position especially for domestic connections. journal of transport geography 87 (2020) 102815 a tangible difference between integrators fedex, ups and integrator dhl can be appreciated if strength and betweenness are analyzed. for fedex and ups, the gap between their main hub (resp. mem and sdf) and the next airports is quite significant both in terms of strength and betweenness (g w in particular). the strength of their top airport is more than one order of magnitude greater than the strength of the next airport, while roughly 60% of all shortest paths pass through their main hubs, with the percentage dropping to 40% for the second airport. this behavior is consistent with a network characterized by a single "tophub". on the other hand, dhl seems to be a hybrid version of a huband-spoke network, as already highlighted in (malighetti et al., 2019a; bombelli et al., 2020) , where lej, cvg, and to a slightly lesser extent hkg share the control of the network. both strength and weighted betweenness confirm this hybrid system for the dhl network, where no top airport clearly outperforms the others, resulting in more balanced cargo capacities between geographical regions. fig. 3 provides hints in this sense as well, with very imbalanced flows for fedex and ups (towards/from mem and sdf), and more balanced flows in the dhl network. a similar conclusion can be inferred analyzing table 4 , where the five od airport pairs characterized by the highest aft are listed per integrator. both for fedex and ups, all five connections either start or end in the main hub, while for dhl hubs lej, cvg, and hkg all appear. focusing on fedex and ups, every connection is domestic. as (bowen, 2012) pointed out, despite the integrators' internationalization, the routes with the highest capacity in both networks remain overwhelmingly domestic. this consideration also fosters a more methodologically-oriented question. fedex, ups, and dhl are obviously international in nature, yet characterized by a different market share depending on the region of interest. hence, are we comparing companies serving the same markets, but with a different network structure, or are discrepancies in the network structure caused by the different market shares? to answer this (very relevant) question, we would need to dive more into business models and economic factors that go beyond the scope of this paper. in addition, given the door-todoor nature of integrators, we might need to combine the air transport network with the ground transport one to get a better picture. this is another research direction this paper does not aim to address. we conclude the discussion on hubs and cargo flows with fig. 4 . here, we reported the aft between the first eight airports to appear in the sorted list of busiest od airport pair connections of each integrator. we represented such aft using chord diagrams. outgoing capacities of each airport are plotted radially, and occupy a portion of the circumference that is proportional to their value. every capacity flow between airports is represented by an arc, whose thickness is also proportional to its value. since each airport is mapped with a different color, the color of each arc is the color of the airport characterized by the positive imbalance of the flow. as example, in fig. 4(a) , the arc between mem and lax is purple because the lax-mem route has a higher aft than the mem-lax connection. to corroborate our previous findings, in fig. 4 (a) and fig. 4 (b) the flows are dominated by a single airport, which in both cases accounts for roughly one third of the overall aft exchanged between the airports. in the dhl case ( fig. 4(c) ), there is a stronger balance between airports. before providing some insights into robustness, we analyzed more into details the degree distribution of the three networks using histograms and plotted the resulting cumulative degree distribution of each network in fig. 5 . in the histograms (fig. 5(a), (b) , and (c)), we also highlight the iata code of the three airports characterized by the three highest values of degree. all three histograms are strongly rightskewed, with the frequency of nodes with a degree equal to k steeply decreasing as k increases. this behavior is consistent with a hub-andspoke system that all three integrators rely on, even with different nuances. we want to highlight the fact that the three histograms, without additional information, are extremely similar and might lead to the (wrong) conclusion that the networks of fedex, ups, and dhl are extremely similar as well. together with the number and distribution of connections, their strength and the likeliness of a node to appear in shortest paths (betweenness centrality) need to be investigated to fully reveal the features of a network. without these additional indices, as example, the multi-hub nature of the dhl network might have been harder to spot. in fig. 5(d) , the cumulative degree distribution p k ( ) of the fedex, ups, and dhl networks is reported. notwithstanding the aforementioned differences between the three networks, the trend of p k ( ) is very similar, and follows a truncated power law consistently with other air transportation networks (guimera et al., 2005; lordan et al., 2014) that confirms scale free properties. in a scale free network, the cumulative degree distribution follows a (truncated) power law. this means that the number of nodes with an extremely high degree is generally higher than what would be expected from a normal distribution. on the other hand, well-connected (i.e., where k ≃ 〈k〉) nodes are much more common in networks whose degree distribution follows a normal distribution rather than a power law. the presence of nodes with an extremely high degree, i.e., the hubs, is a distinctive feature of air cargo networks. as such, fig. 5 (d) confirms, using a network science perspective, one of the underlying operational rules of integrators. in this section, we took a step further and performed a robustness analysis of the three integrators' networks. in fact, all the analyses of section 4.2 rely on a static network, where all nodes and edges are fixed. here, although the definition of network robustness is not univocal, we adopted the same strategy as in (guimera et al., 2005) and (lordan et al., 2014) , and simulated an ad-hoc disruption by an intruder with knowledge of the characteristics of the network. in particular, the intruder focuses on a specific index, and sequentially attacks (which, from a network perspective, reads as "removes") the airport whose selected index is the maximum, i.e., the airport whose removal should be, in principle, most catastrophic. note that this is a dynamic approach, since the removal of each airport, together with all the incoming and outgoing edges, modifies the topology of the reduced network. referring back to table 3 , an intruder attacking the fedex network and focusing on degree, would choose mem as the first airport to eliminate. given the reduced network that does not contain mem, cdg might not be the new airport with the highest degree because of the modified topology. in this context, we first eliminate a node (i.e., an airport), and as a consequence all the edges entering or exiting such node. in real applications, the two steps might be reversed and still lead to the same outcome. as example, during the covid-19 pandemic some airports were entirely closed by governments in order to better control air transport (direct elimination of a node). for other airports, the set of connections might have been dramatically reduced, or even completely eliminated, as an indirect consequence of the closure of the aforementioned airports (indirect elimination of a node). this approach will be shown more into details in section 5 when analyzing time-dependent network characteristics for the different airlines. to compute the disruption severity after each airport removal, we monitored the normalized size of the giant component, defined as where q is the ratio between removed airports and initially available airports, and g c is the initial giant component of the network. consistently with section 4.2, we focused on the following four node-specific indices as removal strategies: -degree k, -strength s,betweenness centrality g, and -weighted betweenness centrality g w . for each of the three networks, we used as starting network the giant journal of transport geography 87 (2020) 102815 component of the associated integrator (that did not coincide with the overall network) in order to have an initial s(q) equal to 1. in fig. 6 we report results in a vertical manner, i.e., per integrator, while in fig. 7 we report results in a horizontal manner, i.e., per removal strategy. in all figures, we also provide an inset plot that highlights the s(q) curve for the first fifteen airports removed, to better highlight how the different networks react when the first (most important) airports are attacked. results in fig. 6 display a high level of consistency across integrators. attacks targeting betweenness centrality and, in particular, weighted betweenness centrality disintegrate the network more abruptly than attacks focusing on other indices. this is due to the transshipment nature of airports characterized by a high betweenness centrality. although they might not be connected to many other airports, they play the crucial role of connecting bridges between airport communities. the best example in this sense is ted stevens anchorage international airport (anc), which plays a key connecting role for cargo flows between asia and north america. focusing on fig. 7 , the effectiveness of attack strategies based on g and g w is confirmed. in fact, these are the two strategies that cause s(q) to drop more steeply. on the other hand, s seems to be the least effective index to target, as the less concave shape of all three curves in fig. 7(b) suggests. note that this is also a natural consequence of the indicator we chose to assess the severity of the disruption. we are basing our results on a connectivity measure (i.e., the size of the giant component), rather than an estimate of the overall cargo capacity capabilities of the remaining network. in the latter case, attacks based on degree or strength would be much more effective because the system would be deprived of the main processing centers. analyzing the different integrators, ups has the least robust network for low values of q across all indices. in particular, when s is considered and q≃ 0.07, the size of the giant component for ups is only the 30% of the original size, while the value increases to at least 60% for the other two integrators. we believe the reason lies in the fact that, for ups, high-capacity airports are also crucial transshipment nodes, and hence disruptions focusing on strength implicitly disintegrate the network as well. on the other hand, at least for small values of q, dhl seems to have most robust network. to summarize the outcome of fig. 6 and fig. 7 , in table 5 the first six removed airports per removal strategy and per integrator are listed, together with the normalized size of the giant component after their removal. we highlighted in bold the airports whose removal produced a percentile reduction of the giant component greater than 10%. only for ups such drops occurred, for cgn (main european hub), anc, and mia (north american hub). both for fedex and dhl, no such cases occurred. analyzing the inset plots, the degradation of s(q) is more regular, which might be a desirable effect against network attacks. dhl is confirmed to be the most robust network, if the size of the giant component is used as measure. in fact, after the removal of the first six airports, dhl constantly ranks best, with the size of the giant component 10% and 30% larger, respectively, than fedex and ups. at the time of writing, the covid-19 pandemic has caused more than 14.3 million confirmed cases, with about 602,800 deaths (johns hopkins coronavirus resource center website, 2020). more than one third of the global population has been, or still is under a partial or total form of lockdown. the ensuing economic crisis is believed to become the most severe crisis in the last decades. among the most affected industries, transportation is one of the businesses that took the hardest blow. while the covid-19 pandemic brought many passenger airlines to the brink of failure, the cargo industry suffered a blow that, although indisputable (accenture website, 2020), is more difficult to quantify. combination airlines lost most of their belly cargo capacity and are experiencing a slow recovery process. full-cargo airlines and integrators should have been affected to a much lesser extent, if not for the unclarity of travel bans. as example, when united states president donald trump announced the travel ban from europe on march 11th, 2020, he initially stated that prohibitions would also affect trade and cargo, only to tweet shortly later that the "restriction stops people not goods" (forbes website, 2020). relying on a dataset that covers both a pre-and a pandemic phase, in this section we shed some light upon the effect of the covid-19 pandemic and the ensuing bans on integrators' capacity. in particular, in section 5.1 we analyzed aft time-series for the three integrators and three other major airlines. we used this analysis to detect temporal variations in air cargo capacity due, most likely, to disruptions caused by covid-19 and as a first assessment of how the pandemic re-shaped cargo flows. then, in section 5.2 we used a complex network theory approach, and computed for the same set of airlines time-varying network characteristics to assess how the connectivity of cargo networks was affected. although the focus of this paper is on integrators, we decided to consider cargo airlines of different kinds for the analyses presented in this section, to offer readers a more comprehensive study. in particular, we selected the three additional cargo operators: we motivate the choice as follows. cargolux constantly ranks in the top-ten of cargo airlines for freight tonne-kilometres, as well as cathay pacific cargo. although cathay pacific, being a passenger airline, can rely on belly space as well, we decided to focus only on its cargo subsidiary. on the other hand, klm is a top-european combination airline journal of transport geography 87 (2020) 102815 in terms of cargo throughput and heavily relies on its belly space. hence, for klm we considered both passenger aircraft and full freighters. for full freighters, we considered both its own fleet, and those operated by martinair holland n.v. (mp). it should be noted that, as far as cargo transport is concerned, klm has a partnership with air france (af) and mp, and cargo operations are carried out in synergy as testified by the name of the joint cargo department afklmp, that is a portmanteau of the three acronyms. while we considered the contribution of mp due to its full freighter-oriented nature, we omitted af to have a more unbiased focus on a single combination airline. as such, we will be using the acronym klmp to represent the combined network of klm and mp. for the three additional airlines, we initially computed their network structure as shown in section 4. details regarding the networks structures are given in appendix b. note that, since our initial choice of airports focused on airports deemed relevant from a cargo perspective, the generated klmp network is missing several airports that are only relevant from a passenger perspective (i.e., airports serving vacation-oriented regions, remote islands, etc). given the nature of this work, this shortcoming was considered negligible. to better characterize the time-series, we also selected five dates that we considered relevant. they are in chronological order: 1) december 31st, 2019 -chinese health officials inform the world health organization about a cluster of 41 patients with a mysterious pneumonia. most are connected to huanan seafood wholesale market, 2) january 11th, 2020 -the first death caused by covid-19 is recorded in china, 3) january 31st, 2020 -united states president donald trump bans foreign nationals from entering the united states if they were in china within the prior two weeks, 4) march 11th, 2020 -united states president donald trump bans all travel from 26 european countries, and 5) may 11th, 2020 -several countries (such as spain, iran, and italy) begin to ease their lockdown restrictions. for the three integrator, we focused on cargo capacities along major connections and generated time-series using the aft associated to each observation. in fig. 8, 9 , and 10 aft time-series for fedex, ups, and dhl are respectively reported. all three integrators display a spike between mid and late december that is consistent with the well-known peak season. if we analyze the trend in february, a substantial difference exists between fedex and the other two integrators. in fact, while aft of fedex remained roughly leveled for all od airport pairs, consistent drops occurred for ups and dhl. for ups, the anc-sdf line experienced a 30% decrease in aft. for dhl, the hkg-anc and anc-cvg lines, i.e., the two major legs of the asian export line to their american hub, experienced a strikingly similar 80% decrease. these drops in capacity correspond to the observation that was retrieved on february 14th, 2020, right after the travel ban to the united states for foreign nationals who were in china in the previous two weeks. we believe that the initial uncertainty regarding the ban (e.g., whether a pilot of a full freighter scheduled to fly from china to the united states would be exempted or not), caused the aforementioned drop. interestingly, after mid february aft for all od airport pairs, on average, either stabilized around the pre-pandemic level or grew to journal of transport geography 87 (2020) 102815 even surpass the december's peaks. a striking example is the steady growth of the anc-sdf line for ups. we also performed a network-wide analysis of the three integrators. for each of them, we constructed three networks, each network being generated following the same routine describe in section 3, but using only a single observation as input. in particular, the three observations we used were january 27th 2020, february 14th 2020, march 2nd 2020, i.e., the three observations associated with the drop in aft. although we previously claimed that a 14-day time-span might not be sufficient to characterize an integrator network, our focus here was mainly on major connections (e.g., flows from asia to the united states) that are flown regularly. hence, comparing networks built using a 14day time-span was deemed reasonable. for each integrator, we plotted the percentile difference between the aft of an observation and the previous one for all od airport pairs characterized by a cargo flow in both observations. the color of each connection is proportional to the percentile difference, with blue colors identifying a strong increase and red colors a strong decrease in atf, respectively. for od airport pairs served in both directions, for plotting purposes we used the average between the two percentile differences as the value representative of the connection. in each colorbar, we limited the upper bound to an increase of 320% and the lower bound to a decrease of −100% to have a clear transition between shades. in fig. 11 and 12 we report the changes in aft between early february and late january 2020, and between late and early february 2020, respectively, for fedex. the same output is shown for ups in fig. 13 and 14, and dhl in fig. 15 and 16 . tables 6 and 7 report, respectively, the five od airport pairs characterized by the highest decrease and increase in aft between observations; only od airport pairs with a maximum value of 800 t or more (between the two observations) are reported in the tables. consistently with figs. 8, 9, and 10, the comparison between early february and late january highlights a strong decrease in aft from north east asia (nea) airports towards the united states, which seem too abrupt to be only a seasonal effect. the effect seems to be most severe for ups and dhl rather than for fedex. in most of the top entries of table 6 , nea airports appear as the origin or destination of the affected cargo flow. the transshipment role of anc for flows from nea to mainland united states (and vice versa) is also highlighted by its presence in eight out of fifteen od airport pairs. the majority of connections with increased aft is represented by intra-continental and intranational routes, mostly between hubs of the associated integrator. examples in this sense are pvg-kix (fedex) and mia-sdf (ups). moving to the comparison between late and early february, the mia-bog connection appeared for all three integrators as one of the connections with the highest percentile decrease. we believe this decrease to be due to valentine's day flower export from colombia (air cargo news website, 2020) rather than an effect of covid-19. more in table 5 first six removed airports according to degree k, strength s, unweighted betweenness centrality g, and weighted betweenness centrality g w for fedex, ups, and dhl networks. general, the situation depicted in table 7 is specular with respect to what is shown in table 6 . aft between nea and united states increased by several orders of magnitude, such as tpe-anc for fedex (+133%), szx-anc for ups (+1000%), and ord-anc for dhl (+978%).the crucial role of anc is highlighted by its inclusion in twelve out of fifteen od airport pairs. temporal differences in aft of od airport pairs relying on anc can also be appreciated comparing we then focused on the three other airlines. we provide results, limited to the time-series format, in fig. 17, fig. 18, and fig. 19 respectively. notwithstanding differences in average aft and number of od airport pairs served, the similarity with the trends noticed for ups and dhl is striking. for cathay pacific cargo, the connection between anc and hkg suffered a decrease of 30% in both directions. for cargolux, the hkg-anc connection suffered an even more substantial drop of 85%, while the connection between novosibirsk tolmachevo airport fig. 9 . time-series depicting aft along major routes for ups. fig. 10 . time-series depicting aft along major routes for dhl. journal of transport geography 87 (2020) 102815 (ovb) and their main hub luxembourg airport (lux) experienced an aft decrease of 50%. interestingly, two new routes appeared from mid february onwards. the first route is lux-ovb, that in the last observations is the one with the highest aft. an explanation for the new prominence of ovb is likely related to cargo carriers preferring resting stops for trans-eurasian route in russia rather than china, to avoid the risk of being stranded for unexpected travel bans (the loadstar website, 2020). the second route connects the second european hub milano malpensa airport (mxp) with their main hub. for klmp, we noticed how aft decrease relatively later, i.e., between late march and early april (see as example the mia-ams and ams-vcp route in fig. 19) . differently from the other airlines, the passenger (and more european) oriented nature of klmp resulted in reduced capacities once european airports closed most of their intra-and inter-european passenger connections. on a similar note, the same capacities regained values comparable (yet still considerably lower) to the pre-pandemic phases once lockdown restrictions were relaxed (line 5 in fig. 19 ). the analyses carried out in section 5.1 highlighted geographicallyspecific variations in network capacity and the resilience of cargo networks to recover from drops in available capacity. on the other hand, information on how the pandemic affected the connectivity of the fig. 11 . aft percentile difference between early february and late january 2020 for fedex. different networks could be guessed, but not explicitly quantified. to this avail, in this section we provide a complex network theory analysis mapping the temporal evolution of connectivity indices for the three integrators and the three other airlines. in particular, in section 4 we generated the cargo networks considering the entire set of thirteen observations to mitigate seasonal effects. following a different approach, in this section we will generate a cargo network for each observation, using the same procedure shown in section 4, in order to highlight unusual seasonal effects. we want to stress the relevance of the word unusual, since we expect to highlight some seasonal effects, such as a spike in available capacity and served connections during the holiday season. what we are looking for are unforeseen outliers that are most likely imputable to the pandemic. for the three integrators and the three other airlines, the temporal evolution of five indices is presented: network-wide aft, -number of edges |ℰ|, -size of the giant component g c , -average degree 〈k〉, and characteristic path length 〈l〉. in fig. 20 the network-wide aft for fedex, ups, dhl, cathay pacific cargo, cargolux, and klmp are presented. for this specific plot, we report the time-series of both klm and mp, together with their cumulative time-series representing klmp, to better understand how the two sub-networks performed during the pandemic in terms of available capacity. for the three integrators, an expected spike in available capacity is detected between mid and late december. the spike is particularly pronounced for fedex and ups. on the other hand, the strong decrease in available capacity from/to nea in early february that was shown in section 5.1, seems to have caused a tangible effect at the network-wide level only to dhl and, with a more pronounced note, to cathay pacific cargo and cargolux. for klmp, a considerable drop in available capacity occurs in mid march, where lockdown restrictions in many european countries severely affected airports' operations. note that the drop is caused by the sudden unavailability of passenger aircraft, and hence is caused by the klm network, while operations for the mp network remain roughly constant throughout the time-horizon. journal of transport geography 87 (2020) 102815 given the different order of magnitude in aft between integrators and klmp, an inset plot focused solely on klmp is also provided in fig. 20 . analyzing the inset plot, a reduction greater than 50% in available aft for the klm network is evident. a similar trend can be observed in fig. 21 , where the number of edges |ℰ| is reported. all integrators are characterized by an increase in available connections during the peak season, with little to none decrease effect in february. on the contrary, both cathay pacific cargo and cargolux experienced a 10-15% decrease in available connections in the same time-frame. as it concerns klmp, a 50% decrease was experienced in march, similarly to what shown for aft. this is clearly due to the partial or total closure of airports worldwide for passenger traffic. journal of transport geography 87 (2020) 102815 full network (generated using all thirteen observations) as described in section 4. as such, for all networks the reported values will range between 0% and 100% to provide a result that is easier to compare and interpret. analyzing fig. 22 , it can be appreciated how the three integrators' networks proved to be more robust in tackling the pandemic, with cargo and cargolux experienced a drop in the size of the giant component, with the former being quicker in recovering. the most interesting trend is the one of klmp, with a sudden drop from 90% to 45% due to the unavailability of belly space. an upward trend is also visible due to countries easing their lockdown restrictions and slowly reopening airports. it should also be noted that we did not set a minimum capacity threshold and artificially removed airports whose strength (i.e., aft level) was lower than the threshold. the dramatic reduction in the size of the giant component is solely due to the lack of passenger connections airports were subject to during the lockdown. we conclude our discussion with the analysis of fig. 23 and fig. 24 , where the average degree 〈k〉 and the characteristic path length 〈l〉 are respectively reported. consistently with the previous plots, integrators are characterized by a higher connectivity during the peak period, while 〈k〉 fluctuates around a constant value otherwise. the average degree is roughly constant for cathay pacific cargo and cargolux as well, apart from a decrease in mid february that might have been caused by the reduction in capacity from/to nea airports. for klmp, a decrease of roughly 50%, consistently with the other indices, was experienced. the decrease in average connections per airport resulted in a spike in 〈l〉 for klmp (fig. 24) : some direct connections were lost and more transshipment stops were needed as a consequence. for the other temporal evolution for the six airlines. carriers, a decrease in 〈l〉 for cargolux is evident, that is consistent with the increase of 〈k〉 due to the adoption of pandemic-induced new routes as shown in section 5.1. for fedex, ups, dhl, and cathay pacific cargo, the effects of the pandemic on 〈l〉 seem negligible. although the network analysis presented in this section focused on the short-term effects of the covid-19 pandemic on cargo networks, some general conclusions can be drawn. given the different trends we highlighted for network indices such as global strength (network-wide aft), number and quality of connections, and connectivity (size of the giant component), we argue that: 1. integrators might be the great winners in this unforeseen set of circumstances (suau-sanchez et al., 2020) . covid-19 is potentially re-orienting some airports towards cargo as part of the growing importance of e-commerce, that during the pandemic saw a surge in usage rate. the forced quarantine led to unprecedented increases in purchases of the following categories: medical (+500%), baby products (+390%), food & beverage (+150%) (big commerce website, 2020), just to cite a few examples. integrators were the only cargo airlines that, apart from capacity fluctuations during the most uncertain period of the pandemic, maintained a capacity level comparable, if not even higher, than the pre-pandemic level. integrators' networks proved to be both robust and resilient. the robustness is confirmed by the fact that network indices such as network-wide aft, number of connections |ℰ|, and size of the giant component g c were marginally affected (section 5.2). their resilience is evident given their capability, thanks to flexible schedules for full freighters, to quickly rebound from momentarily losses in available capacity among major od airport pairs (section 5.1) 2. on the other hand, passenger airlines heavily relying on belly space journal of transport geography 87 (2020) 102815 for cargo services might be the great losers. while we focused on a single combination airline in the paper, and hence our results might not be universally valid, the klmp network proved to be somehow resilient (see recovery trend for all network indices), but not robust. the drop in network-wide aft and g c as a result of lockdown measures was extremely dramatic. notwithstanding the fact that the air cargo business is generally of secondary relevance for combination airlines, they might be reconsidering the recent shift towards belly space utilization (potentially phasing-out full freighter aircraft). the trend, motivated by the extensive passenger network, proved to be a double-edged sword. while some airlines, including klm, temporarily reconverted some of their passenger aircraft to "cargo-in-cabin" aircraft (klm website, 2020), this is clearly not a long-term solution. as the passenger air network is much less robust to pandemic-induced disruptions than the cargo counterpart, combination airlines might have second thoughts before phasing-out full freighters (given their crucial role in case of a new pandemic wave, or of other unforeseen disruptions affecting the passenger network), if they plan to remain competitive in the air cargo business. in this paper, we provided a thorough analysis of the network structure of integrators fedex, ups, and dhl, using historical data from public sources and estimated cargo weight capacity between airports to model each network. we considered networks as directed to model the strong flow imbalances and triangular routes that characterize cargo networks. our results show that fedex owns the most developed network in terms of overall capacity, but dhl is more developed in terms of airports and connections. this factor can also be attributed to the different business strategy of dhl, which heavily relies on a set of airlines operating under its livery, differently from fedex and ups. in addition, while fedex and ups, although they do rely on a vast set of secondary hubs, seem to be based on networks designed following the classic "hub-and-spoke" paradigma, the structure of the network of dhl is more hybrid and steers towards a "multi-hub" system. related to the previous point, we analyzed the robustness of the three networks under different node attack strategies, and used the size of the giant component (i.e., the cardinality of the set of nodes of the network that are all connected to each other) as a measure of robustness. we found out that the dhl network is more robust, with no node removal (among the first six removals) that reduces the size of the giant component more than 10% and a resulting final percentile size of the giant component greater than fedex or ups. we also want to highlight that our definition of robustness is based on a specific quality measure that focuses more on connectivity properties rather than the strength, measured as remaining overall capacity, of the network. hence, it cannot be claimed that the dhl network is the most robust from a universal perspective. given that, at the time of writing, the covid-19 pandemic is still affecting supply chains worldwide, we also performed a time-series analysis to assess how capacity along some major od connections was affected, and how network-specific indices changed at different stages of the pandemic. in order to have a more comprehensive perspective, we also included three other relevant airlines: two full-cargo airlines and a major combination airline. we noticed a steep decrease in available capacity for integrators and full-cargo airlines between north east asia and the united states and europe in early february, i.e., right after the united states issued their travel ban from china. in early march the situation reversed and capacities recovered and even surpassed their nominal values especially for integrators. this factor testifies how integrators' networks are resilient and capable of quickly adapting to disruptions. we also proved that integrators' networks are robust, since network-wide indices did not show major changes during the pandemic. the same cannot be argued for the combination airline we considered. although signs of a slower-paced resilience are irrefutable, its network is not robust and connectivity properties were severely affected during the pandemic. we used this result to argue that the inclination of some combination airlines towards belly space, rather than full freighters, might be re-evaluated considering the likelihood of a new pandemic wave and the relevance of cargo services for those airlines. although we believe this work to be a solid first step towards a better understanding of (i) the global network structure of integrators, and (ii) the effect of covid-19 on cargo flows, we are also aware that it can be improved and extended in several ways by exploring some additional research directions. as example, provided the availability of a broader dataset in terms of airports (or a way to quickly gather data), the lower-tier airports that were omitted in this work could be included to better model low-capacity routes. another interesting addition is the analysis of the network structure of amazon air. at the time of writing, the fleet of amazon air is still under development, and hence we deemed its inclusion in this work to be premature, but in one year from now time should be ripe. although the business model of amazon air is slightly different than the other integrators (in the sense that, on top of being an integrator, it also directly sells the goods that are being transported and delivered), airport slot capacity and competition issues might influence the network strategy and configuration of the other integrators as well. as example, the main hub of amazon air will be cincinnati/northern kentucky international airport that, coincidentally, is the main american hub for dhl. we do believe the introduction in the cargo game of such a huge player is worth a more extensive analysis. the last research direction is related to the covid-19 pandemic. in this paper, our analysis focused on short-term changes in network capacity and connectivity indices due to the pandemic. similarly to the aforementioned point made for amazon air, it would be interesting to follow the evolution of airline networks over time and assess whether covid-19 caused more permanent changes (e.g., creation of new routes, re-structuring of the connections to/from hubs) in their network. i would like to thank dr. bruno f. santos and prof. lori tavasszy, both from delft university of technology, for our discussions on the cargo business that partially shaped this research. i also would like to thank the three anonymous reviewers for their comments and extremely careful reviews, that greatly improved the quality of this paper. for fedex, ups, and dhl we report in table 8 the list of aircraft we considered, with full name, code, and maximum payload. maximum payload values were obtained either from the integrator's webpage, or from the manufacturer's webpage. since different cargo airlines generally have different unit load device (uld) configurations, maximum payload values for the same aircraft used by different integrators might differ. for cathay pacific cargo, cargolux, and klmp we report the main characteristics of their networks in table 9 , their network visualization in fig. 25 , the five od airport pair connections with the highest aft in table 10 , and a chord diagram depicting major cargo flows between airports in fig. 26 . table 9 cathay pacific cargo, cargolux, and klmp network characteristics. air cargo news website scale-free networks: a decade and beyond emergence of scaling in random networks analysis of the air cargo transport network using a complex network theory perspective measuring connectivity in the air freight industry a spatial analysis of fedex and ups: hubs, spokes, and network structure resilience of the internet to random breakdowns recovering from demand disruptions on an air cargo network clustering in complex directed networks a set of measures of centrality based on betweenness the worldwide air transportation network: anomalous centrality, community structure, and cities' global roles integrated air freight cost structure: the case of federal express the hub network design problem with stopovers and feeders: the case of federal express economies of traffic density and scale in the integrated air cargo industry: the cost structures of fedex express and ups airlines robustness of the air transport network connectivity of the european airport network air transport networks of global integrators in the more liberalized asian air cargo industry integrators' air transport networks in europe global air cargo flows estimation based on od trade data moving boxes by air: the economics of international air cargo a translog cost function of the integrated air freight business: the case of fedex and ups clustering in weighted networks a comparative study of airport connectivity in china, europe and us: which network provides the best service to passengers? epidemic spreading on complex networks with community structures exploring complex networks an early assessment of the impact of covid-19 on air transport: just another crisis or the end of aviation as we know it? the loadstar website the motley fool website collective dynamics of "small-world" networks key: cord-343419-vl6gkoin authors: lee, pei-chun; su, hsin-ning title: quantitative mapping of scientific research—the case of electrical conducting polymer nanocomposite date: 2010-07-10 journal: technol forecast soc change doi: 10.1016/j.techfore.2010.06.002 sha: doc_id: 343419 cord_uid: vl6gkoin this study aims to understand knowledge structure both quantitatively and visually by integrating keyword analysis and social network analysis of scientific papers. the methodology proposed in this study is capable of creating a three-dimensional “research focus parallelship network” and a “keyword co-occurrence network”, together with a two-dimensional knowledge map. the network and knowledge map can be depicted differently by choosing different information for the network actor, i.e. country, institute, paper and keyword, to reflect knowledge structures from macro, to meso, to micro-levels. a total of 223 highly cited papers published by 142 institutes and 26 countries are analyzed in this study. china and the us are the two countries located at the core of knowledge structure and china is ranked no. 1. this quantitative exploration provides a way to unveil important or emerging components in scientific development and also to visualize knowledge; thus an objective evaluation of scientific research is possible for quantitative technology management. 1. mapping knowledge structure 1.1. mapping knowledge structure by bibliometric analysis thomas kuhn popularized the terms "paradigm" and "paradigm shift" [1] . dosi investigated technology trajectories on the basis of paradigm shifts and found that continuous innovation can be regarded as proceeding within a technology paradigm, while discontinuous innovation might be the initiation of a new paradigm [2] . many researchers proposed and applied these methodologies to various knowledge fields for understanding the paradigm or the dynamic development of selected knowledge fields [3] . the methodology that usually is used for this purpose is bibliometric analysis on the basis of literature publication metadata and information. for example, kostoff has very complete and systematic studies on literature-related analysis and published a series of papers based on combination of text mining and statistics on scientific papers. he also proposes a systematic literature-related discovery method for linking two or more literature concepts that have heretofore not been linked, in order to produce novel, interesting, plausible, and intelligible knowledge [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] . ding et al. mapped information retrieval research by using co-word analysis on papers collected from the science citation index (sci) and social science citation index (ssci) for the period of 1987-1997 [21] . baldwin et al. mapped ethics and dementia research by using keywords [22] . tian et al. used the institute for scientific information (isi) database to measure scientific output of the field of geographic information system (gis) by using keywords [23] . similar approaches have been made to map knowledge evolution in other fields, such as software engineering [24] , chemistry [25] , scientometrics [26] , neural network research [27, 28] , biological safety [29] , optomechatronics [30] , bioeletronics [31] , adverse a "structure" defines what makes up a system. a "structure" is a collection of inter-related components or services [67] . the more concrete way of describing a structure is "network" where the two main components of a network, 1) network actors and 2) network ties, correspond respectively with "components" and "inter-relationship among components" in a structure. therefore applying network theory to understanding knowledge structure should be feasible if network actors and network ties can be welldefined. we previously investigated the knowledge structure of patented technology for the same field of "electrical conducting polymer nanocomposite" by integrating network theory and patent citation information [68] . in this case, patent is the network actor and citation is the network tie. in this study, we apply network theory on scientific papers to draw the knowledge structure of "electrical conducting polymer nanocomposite". to clearly depict the range within the boundary so that an objective definition of a selected scientific area can be widely accepted, a group of core keywords is believed to be essential. those keywords should be filtered out from the literature that represents this field. the second question is "what is the knowledge structure of science"? after a group of core keywords are retrieved from literature publications, any level of research unit, i.e. author, institute, or country, which contains the obtained keywords, can be used as network actors. then the concurrence of keywords is used to establish the relationships among network actors. accordingly, the knowledge structure of science can be drawn since the two basic requirements for establishing a network structure, network actors and network ties, have been met. the basic components of a social network can be different forms of social actors, i.e. individuals, institutions, or countries. a social network formed on the basis of social exchange can be used for understanding how resources are exchanged, how social actors are positioned to influence resource exchange, and which resource exchange is important [69] [70] [71] . each resource exchange is a social network or a "tie" maintained by social actors at both ends of the "tie". the strength of a tie is a function of the number of resources exchanged, the type of exchange, the frequency of exchange, or even how close the two connected actors are [72] . the characteristic of a social network is its small-world phenomenon. according to watts' definition-"small" means that almost every element of the network is somehow "close" to almost every other element, even those that are perceived as likely to be far away [73] . the "small-world" phenomenon can be found not only in social network, but also in biological, and technological systems [73] . social network analysis has also been applied to the creation of information networks, but the applications can be justified only if the small-world phenomenon can be discovered in the information network. callon et al. [74] suggested interactive processes, mixing both cognitive and social aspects of knowledge or technology. both actors and interactions can usually be described by texts, and specifically, by words. thus knowledge development can be described through keyword network development. newman reviewed structure and function of complex networks, and argued for both the word matching network possesses and the obvious characteristics of small-words [75] the research of cancho [76] also justified the small-world phenomenon in english word networks. watts and strogatz and watts contributed to expansion of the small-world concept, from conventional neuroscience and bio-information systems to any natural or human system that can be modeled by a network [77, 78] . motter et al. constructed a conceptual network from the entries in a thesaurus dictionary, considering two words connected if they express similar concepts. he argued that language networks exhibit the small-world property as a result of natural optimization, and these findings are important not only for linguistics, but also for cognitive science [58] . based on the hypothesis that human language possesses small-world phenomenon, which has been supported previously [74] [75] [76] 79] , the purpose of this study aims to shed light on the combination of social network analysis and bibliometric analysis on the field of electrical conducting polymer nanocomposite by using different publication information, e.g. keyword, author names, research institute, and country, as actors in network, in order to understand the knowledge structure of the selected field. the network actors and network ties that correspond to publication information and keyword co-occurrence, respectively, can be visualized. thus the dynamic knowledge evolution can be mapped. furthermore, network properties of networks created in this study can be calculated to obtain quantitative analysis of knowledge evolution [80] . this research integrates social network analysis and bibliometric keyword analysis to draw a picture for the development of science, which can be called a "science map". hence each country, research institution, or researcher that contributes to the literature can be positioned. processes for this research method are: 1) literature retrieval and filtering; 2) keyword revision and standardization; 3) visualization of three-dimensional network; 4) network properties calculation; and 5) visualization of two-dimensional knowledge map. web of science (sci and ssci) literature database is used for paper retrieval. search strategy is: nano* in topic and composite* in topic and polyme* in topic and conduct* in topic, document type is set as "article" for retrieving journal papers as our research target. a total of 3382 papers are obtained from web of science database. in the 3382 papers, only 2240 papers contain author keywords. the top 10% papers (223 highly cited papers which receive citations ≧ 24) are selected for following quantitative mapping analysis. due to the fact that different words can be used for describing the same concept, it is necessary to standardize words that are used to express the same concept. for example, 1) singular or plural words are standardized to singular form; 2) cnt is standardized to carbon nanotube; and 3) modeling and modeling are standardized to modeling, etc. networking of keywords is based on sufficient relations among keywords. a relation is presented as a "network tie". this study provides two methods of generating network ties. 1) the relation between two different papers occurred because these two papers share at least one keyword. a network generated by this method is defined as rfp network (research focus parallelship network. 2) the relation among plural keywords occurred because these keywords are listed in the same paper. a network generated by this method is defined as kco network (keyword co-occurrence network). detailed explanation for these two methods is as follows: 1) rfp network (research focus parallelship network): the relation between two different papers occurred because these two papers share at least one same keyword. for example, paper is used as a network actor (network node) and any of two actors sharing one same keyword will be linked. this is based on an assumption made in this study that keyword represents the core research of a paper. any two papers sharing the same keyword implies that these two researches are partially overlapping in an area that can be represented by that keyword. the two papers are thus regarded as a pair of parallel papers and the constructed network is defined as an rfp network. however, network node is not necessarily the paper. it can also be other actors which carry knowledge, e.g. paper (first author), research institute, or country. the three types of rfp networks are summarized as: • rfp-country network: research focus parallelship network with country as the network actor • rfp-institute network: research focus parallelship network with research institute as the network actor • rfp-paper network: research focus parallelship network with paper (first author) as the network actor. in this study, rfp-country network, rfp-institute network, rfp-paper network are investigated in order to understand knowledge structure at macro, meso, and micro levels, respectively. 2) kco network (keyword co-occurrence network): the relations of author keywords are formed because author keywords specified by authors are listed in the same paper. hence, the actual keyword is used as the network actor. author keywords listed in the same papers are linked together because they are all terms that can be used to represent the core concepts of a research paper and strong relations to each other can be expected. • kco network: keyword co-occurrence network. in this study, kco network is investigated in order to understand co-occurrence of keywords in papers at micro level. computer software is used to visualize rfp network and kco network and then network properties are calculated. in social network theory, centrality is used to estimate the influence of actors. centrality as an indicator can be used to understand to what degree an actor is able to obtain or control resources. brass and burkhardt indicated network centrality is one source of influence from the viewpoint of organizational behavior, and a person with higher centrality in an organization is always the one with higher influence [81] . freeman suggested three methods of centrality measurement for a network: 1) degree centrality, 2) betweenness centrality, and 3) closeness centrality [82] . network properties are calculated by the above three methods in order to understand the influencing power of country, research institute, and paper. networks constructed in this research are undirected networks because no in-and-out concepts, e.g. causal relation, position difference, flow, or diffusion, existed behind any linked network actors in this study. network nodes (actor) which directly linked to a specific node are neighborhood of that specific node. the number of neighbors is defined as nodal degree, or degree of connection. granovetter suggested that nodal degree is proportional to probability of obtaining resources [61] . nodal degree represents to what degree a node (actor) participates the network; this is a basic concept for measuring centrality. the concept of betweenness centrality is a measure of how often an actor is located on the shortest path (geodesic) between other actors in the network. those actors located on the shortest path between other actors are playing roles of intermediary that help any two actors without direct contact. actors with higher betweenness centrality are those located at the core of the network. g jik g jik g jk shortest path between actor j and actor k g jik the shortest path between actor j and actor k that contains actor i the closeness centrality of an actor is defined by the inverse of the average length of the shortest paths to/from all the other actors in the network. higher closeness centrality indicates higher influence on other actors. shortest path between actor j and actor i in this study, two-dimensional maps are also obtained by calculating relative positions and density of network actors on the basis of the previously constructed network. these are "two-dimensional knowledge maps" since they directly reflect the fundamental structure of knowledge. the algorithm used in this study is proposed by van eck and waltman's in 2007 [83] . 1) actor position: the positions of network actors in the map are based on visualization of similarities. if there are totally n actors, a two-dimensional map where the actor 1-n are positioned in a way that the distance between any pair of actor i and j reflects their association strengths a ij as accurately as possible, i.e. distance between i and j is proportional to a ij , van eck and waltman's algorithm is used to minimize a weighted sum of the squared euclidean distance between all pairs of actors, the objective function to be minimized is given as below: where the vector x i = (x i1 ,x i2 ) denotes the location of actor i in a two-dimensional space and ‖ • ‖ denotes the euclidean norm. 2) actor density: actor density at a specific location in a map is calculated. the actor density is calculated by first placing a kernel function at each actor location and taking a weighted average of the kernel function. the actor density at location x = (x 1 , x 2 ) is given by where k denotes a kernel function and h denotes a smoothing parameter. c ii denotes the number of occurrences of actor i and x = (x 1 ,x 2 ) denotes the location of actor i in the map. the kernel function k is a non-increasing gaussian kernel function given by: table 1 . the chinese academy of science is the institute that publishes the largest number of paper (126 papers), then the indian institute of technology (63 papers), then tsin hua university (43 papers). six out of the top 10 institutes are chinese institutes, indicating significant contributions from china (table 2) . for research areas analysis, most of the papers belong to material science and polymer. table 3 shows top 10 subject areas are mainly chemistry, physics, or material related in both science and engineering. table 4 shows the top 10 journals in terms of number of papers. the journal of applied polymer science has 199 papers and ranked no. 1, then polymer (132 papers), and synthetic metals (103 papers), etc. papers are classified by research institution, and any two research institutions with the same keyword are linked together. a total of 142 network actors and 415 network ties are obtained and shown in fig. 3 . bigger circles which indicate higher centralities can be observed in the central part of fig. 3 . many of these high centrality institutes are chinese, e.g. chinese academy of science, nanjing university, chinese academy of engineering physics, huaqiao university, hong kong polytechnic university, tsing hua university, etc. some are not chinese, e.g. brazil's federal university of parana, singapore's nanyang technology university, etc. brazil's federal university of parana has the highest degree centrality, and chinese academy of science has the highest betweenness centrality and closeness centrality. 3) rfp-paper network: any two actors (first author/paper) with the same keyword are linked together. a total of 223 network actors and 512 network ties are obtained and shown in fig. 4 . the highest centrality paper is authored by zarbin from brazil. a large number of high centrality papers are authored by chinese, e.g. deng, chen, zheng, lau, feng, and yu. deng has three papers and two of them are ranked in the top 10 centralities. the obtained rfp-paper network not only serves as evidence for understanding whether a paper is positioned as a "hub" or "edge", but also provides potential for diverse applications, e.g. identifying who is a potential partner/competitor, and a potential reviewer or expert. each keyword is treated as a network actor; keywords within the same papers are linked together. a total of 482 network actors generating 671 network ties were originally created, but the large number of network actors and ties lead to a highly dense network structure which is hard to interpret visually. therefore, we select the most important 77 network actors which degree centralities are equal to or larger than 4 for constructing kco network, a total of 77 network actors and 220 network ties are shown (fig. 5) . actors with higher centrality and thicker ties form the major backbone of this knowledge field. fig. 5 shows the backbone actors are nanocomposite, composite, conducting polymer, carbon nanotube, polyaniline, and polypyrrole. these are together with some other important actors, e.g. multi-wall carbon nanotube, electrical property, caron nanofiber. these 77 keywords can be categorized into structure, property, compound, process, and application, and are directly or indirectly connected to the backbone actors. carbon nanotube has surprisingly high centrality, indicating its critical role in this field. by evaluating connectivities among actors globally, the rationale why two actors are linked maybe obtained; hence it is possible to create scenarios for a particular actor. for example, the analysis of actors connected to carbon nanotube provides some findings: is 1) carbon nanotube is the most important nano-scaled structure in the field, 2) carbon nanotube possibly helps reinforce mechanical property and electrical property, and 3) there is the potential to apply electrical conducting nanocomposite polymer on biosensor or supercapacitor with the aid of carbon nanotube. calculation methods based on three network centralities, i.e. degree centrality, betweenness centrality, closeness centrality, are used to calculate network properties to understand network actors' relative position in a network. for rfp-country network, countries with top ten network properties are listed in table 5 . china has the highest centrality and then usa, brazil, france, germany or korea, etc. the number of papers that each country contributes to this field is different; however, it is easy to anticipate that countries with more papers tend to have more linkages to other countries because of their larger number of papers. more papers mean more opportunities to create linkages to other actors. therefore countries with more papers are anticipated to have higher centrality and are thus being positioned at the core of the network. accordingly, countries with more papers shown in table 1 are mostly consistent with countries with higher centrality calculated in table 5 . however, there are some exceptions. for example, india is ranked number 3 in terms of number of papers, but not in the top 10 centralities ranking. brazil is ranked outside the top 10 of numbers of papers, but ranked top 3 for its high centrality. this might have something to do with the emergence of china and india which seek to improve their global competitiveness by increasing their research quantities or qualities. for rfp-institution network, research institutions with top ten network properties are listed in table 6 . research institutions with the highest centralities are federal university of parana, chinese academy of science, nanjing university, chinese academy of engineering and physics, huaquio university, etc. the three types of top 10 centralities for institutes show chinese institutes dominate this field. only five institutes in table 4 are not chinese, i.e. brazil's federal university of paraná, singapore's nanyang technological university, canada's university laval, us's national institute of aerospace, and korea's seoul national university. brazil's federal university of paraná is surprisingly high and ranked no. 1 or no. 2. for rfp-paper network, first authors with top 10 centralities are shown in table 7 and are mostly chinese authors except for 1) zarbin_ajg (brazil's federal university of paraná) who is ranked no. 1 in three types of centralities, and 2) mclachlan_ds (us's national institute of aerospace) who is ranked no. 6 in degree centrality, no. 5 in betweenness centrality and no. 4 in closeness centrality, 3) shi_gx (canada's university laval) who is ranked no.7 in degree centrality and no. 10 in betweenness centrality, and 4) dalmas_f (france's national institute of applied sciences) who is ranked nos. 6 and 8 in closeness centralities. for kco network with keyword as network actor, the keywords with top 20 centralities are listed in table 8 . due to the research target "electrical conducting polymer nanocomposite" set in this study, keywords with higher network centralities are expected to be lexically or conceptually related to this core research area. this is why composite, conducting polymer, conductivity, electrical conductivity, electrical property, nanocomposite, nanostructure, and polymers can be observed in table 8 . however, some keywords which we were not previously aware of, for example, nanotubes, carbon nanotube, single-wall carbon nanotube, multi-wall carbon nanotube, polyaniline, polyethylene oxide, and polypyrrole, provide insights of what relates to our core area, and can be used to characterize the technology trajectory of electrical conducting polymer nanocomposite. in table 8 , the core area related keywords can be categorized into 1) structure: composite, nanocomposite, nanostructure, polymers, nanotubes, carbon nanotube, single-wall carbon nanotube, and multi-wall carbon nanotube, 2) property: conducting polymer, conductivity, electrical conductivity, and electrical property, and 3) compound: polyaniline, polyethylene oxide, and polypyrrole. these three categories reflect important implications that should be strongly considered in understanding the development context of "electrical conducting polymer nanocomposite". polyaniline, polyethylene oxide, and polypyrrole are the table 5 top 10 centralities countries. degree centrality betweenness centrality closeness centrality 1 china china china 2 usa usa usa 3 brazil brazil brazil 4 france france france 5 korea germany germany 6 germany taiwan korea 7 singapore korea taiwan 8 taiwan italy england 9 england england singapore 10 india india italy nanjing_univ nanjing_univ nanjing_univ 4 chinese_acad_engn_&_phys hong_kong_polytech_univ huaqiao_univ 5 huaqiao_univ tsing_hua_univ chinese_acad_engn_&_phys 6 nanyang_technol_univ chinese_acad_engn_&_phys tsing_hua_univ 7 hong_kong_polytech_univ nanyang_technol_univ univ_laval 8 tsing_hua_univ huaqiao_univ seoul_natl_univ 9 univ_laval natl_inst_aerosp natl_inst_aerosp 10 natl_inst_aerosp natl_univ_singapore nanyang_technol_univ three most important compounds. carbon nanotube, particularly single-wall or multi-wall carbon nanotube, are the critical nanostructures involved in the field. the constructed two-dimensional maps (figs. 6-9) provide a quick way for human eyes to perceive knowledge structure of electrical conducting polymer nanocomposite with country, institution, paper (first author), and keyword as actors to facilitate different levels of observations. 1) country as actor: fig. 6 illustrates the country knowledge map where all these countries are uniformly distributed everywhere in this map. this indicates a nice international collaboration; based on which, each country finds its particular way to contribute different knowledge to this field. the more uniform distribution of actors in the map implies higher efficiency for the knowledge to be developed. in fig. 6 , the distribution of countries is pretty uniform but still we can find two isolated islands 4) keyword as actor: fig. 9 shows a y-shaped continent and two separated islands denoted as "membrane" located on the left and "electrorheology" located on the middle bottom. the y-shaped big continent refers to the major trend of electrical conducting polymer nanocomposite. but "membrane" island and "electrorheology" island are conventionally associated with traditional polymer. this is why both are small isolated islands separated from the big y-shaped continent featured by modern nano related technology. to understand how keyword components are positioned to construct the big continent, the big continent can be further magnified to understand more detailed components. "electrical conducting polymer nanocomposite" has gradually become an important research field which requires a systematic analysis of its knowledge structure. this study integrates social network analysis and keyword analysis to investigate knowledge structure of "electrical conducting polymer nanocomposite." the purpose is to examine systematically fundamental components underlying this research field investigated differently in different regions of the world. in summary, this study proposes four types of three-dimensional networks based on co-occurrence of keywords for full spectrum analysis on scientific papers, i.e. rfp-country network, rfp-institute network, rfp-paper network and kco network. these reflect knowledge structures on macro, meso, micro, and also micro levels, respectively. a total of 482 keywords contained in 223 highly cited papers (number of received citations ≧24) have been analyzed in this study. three-dimensional networks and two-dimensional maps are quantitatively and visually created to describe the knowledge structure. keywords such as nanocomposite, carbon nanotube, polyaniline, conducting polymer, composite, polypyrrole, multi-wall carbon nanotube, electrical conductivity, etc. are important components of the backbone knowledge structure of electrical conducting polymer nanocomposite. also, china, us, and brazil are countries located at the core of the structure. conventional bibliometric analyses on most research fields for the purpose of performance evaluation usually show that the us is ranked no. 1, followed by either japan or europe which easily rank no. 2 or 3. however, in the field of electrical conducting polymer nanocomposite, the networks and maps constructed in this study indicate that china is ranked no. 1. however, if we compare several publication performance indicators used in table 9 and fig. 10 , the number of highly cited paper for both us and china are both equal to 55. similarly the median citations per paper (41 for the us and 42 for china) indicate that the statistics are not dominated by a few highly cited papers. citation analysis which is usually used as an indicator of paper quality shows that average citations received by highly cited paper are 62.7 (citations/paper) for the us and 47.4 (citations/paper) for china. the us is higher and china is lower than the global average (54 citations/paper), and each us paper receives 15 more citations on average than chinese papers. in addition, the most highly cited paper of the us has been cited 319 times, while the most highly cited paper of china has only been cited 98 times. us papers have greater impact on later publications; so there is no doubt that the us has more advanced developments in electrical conducting polymer nanocomposite. all of this suggests that the us should not be second to china. these observations are different from keyword-based network analysis in this paper. the discrepancy between the findings obtained from keyword-based network analysis and the findings from highly cited paper comparison ( fig. 10 and table 9 ) is due to the different mechanism used for evaluating scientific papers. the higher centrality of china than the us obtained in this study has more to do with the keyword linkages involved in the network. china has better accessibility to other countries by way of keyword linkages and is therefore ranked no. 1 in network centralities. also, china has a larger number of papers (896 papers) than the us (787 papers); although its total impact is not comparable to the us. in summary, the lower citation impact but higher keyword linkage phenomenon of china implies that china has more general research interests which significantly overlap those of other countries. us is still dominating advanced technology, but china currently surpasses other countries by the quantity of research. it is expected that the number of chinese papers and the citations received by chinese papers will both increase as rapidly as the pace of china's economic growth. the two-dimensional knowledge maps (figs. 6-9) provide the basis for a quick and careful, though still limited, comparison of competitiveness. fig. 9 provides the basis for understanding which concepts are fundamental building blocks in electrical conducting polymer nanocomposite. by the use of figs. 6-9, researchers can understand how a country, institution, or paper (first author) can be positioned in the knowledge map. the knowledge maps obtained quantitatively allow other potential quantitative applications, e.g. 1) r&d resource allocation, 2) research performance evaluation, 3) understanding of future research opportunity, and 4) potential collaborator or competitor identification. the knowledge maps created in this study provide a quick way for international benchmarking or potential partnership identification. hsin-ning su is an associate researcher of science and technology policy researcher and information center, national applied research laboratories, taiwan. he received ph.d. in material science and engineering from illinois institute of technology and m.s. in chemistry from national taiwan university. his research interests are science and technology policy, innovation system, social network analysis, knowledge evolution, science map, bibliometric and patent analysis, aiming to understand evolutionary mechanism of sci-tech development by interdisciplinary approaches and contribute to national level technology management. the structure of scientific revolutions technological paradigms and technological trajectories: a suggested interpretation of the determinants and directions of technical change a bibliometric approach towards mapping the dynamics of science and technology database tomography for technical intelligence: a roadmap of the near-earth space science and technology literature electrochemical power text mining using bibliometrics and database tomography power source roadmaps using bibliometrics and database tomography the structure and infrastructure of mexico's science and technology literature-related discovery (lrd): introduction and background literature-related discovery (lrd): potential treatments for cataracts assessment of china's and india's science and technology literature-introduction, background, and approach assessment of india's research literature chinese science and technology-structure and infrastructure comparisons of the structure and infrastructure of chinese and indian science and technology global nanotechnology research literature overview literature-related discovery (lrd): lessons learned, and future research directions literature-related discovery (lrd): methodology literature-related discovery (lrd): potential treatments for multiple sclerosis literature-related discovery (lrd): water purification literature-related discovery (lrd): potential treatments for parkinson's disease comparison of china/usa science and technology performance bibliometric cartography of information retrieval research by using co-word analysis ethics and dementia: mapping the literature by bibliometric analysis global scientific production on gis research by bibliometric analysis from 1997 to software engineering as seen through its research literature: a study in co-word analysis co-word analysis as a tool for describing the network of interactions between basic and technological research: the case of polymer chemistry a coword analysis of scientometrics monitoring scientific developments from a dynamic perspective: self-organized structuring to map neural network research the neural net of neural network research: an exercise in bibliometric mapping historical scientometrics? mapping over 70 years of biological safety research with coword analysis bibliometric cartography of scientific and technological developments of an r & d field bibliographical cartography of an emerging interdisciplinary discipline: the case of bioelectronics mapping the dynamics of adverse drug reactions in subsequent time periods using indscal health promotion research literature in co-word maps of biotechnology: an example of cognitive scientometrics corpus relevance through co-word analysis: an application to plant proteins bibliometric analysis of adsorption technology in environmental science mapping a research area at the micro level using co-word analysis bibliometric analysis of severe acute respiratory syndrome-related research in the beginning stage bibliometric analysis of tsunami research bibliometric analysis on global parkinson's disease research trends during tracking emerging technologies in energy research: toward a roadmap for sustainable energy structure of research on biomass and bio-fuels: a citation-based approach citation network analysis of organic leds polymer-clay nanocomposites clay-reinforced epoxy nanocomposites on significant retention of impact strength in clay-reinforced high-density polyethylene (hdpe) nanocomposites properties of polymer-nanoparticle composites carbon nanotubes/fluorinated polymers nanocomposite thin films for electrical contacts lubrication electromagnetic applications of conducting and nanocomposite materials, the new frontiers of organic and composite nanotechnology recent progress in the development of nano-structured conducting polymers/nanocomposites for sensor applications conducting polymer/clay nanocomposites and their applications recent progress in conducting polymer, mixed polymer-inorganic hybrid nanocomposites nanocomposites based on conducting polymers and carbon nanotubes: from fancy materials to functional applications conducting polymer nanocomposites: a brief overview redistributed random sampling method for categorizing materials research publications from sci database: metallurgy and polymer subfields a modified method for calculating the impact factors of journals in isi categorization and trend of materials science research from science citation index (sci) database: a case study of ceramics, metallurgy, and polymer subfields citation analysis as a collection development tool: a bibliometric study of polymer science theses and dissertations the level of research in advanced composite materials in the countries of the former soviet union china's emerging presence in nanoscience and nanotechnology: a comparative bibliometric study of several nanoscience 'giants', res publications and patents in nanotechnology the structure and infrastructure of the global nanotechnology literature whither nanotechnology? a bibliometric study the global institutionalization of nanotechnology research: a bibliometric approach to the assessment of science policy nanotechnology publications and citations by leading countries and blocs quantitative mapping of patented technology-the case of electrical conducting polymer nanocomposite networks and organizations: structure, form, and action advances in social network analysis: research in the social and behavioral sciences introduction: studying social structures, social structures: a network approach measuring tie strength networks, dynamics, and the small-world phenomenon co-word analysis as a tool for describing the network of interactions between basic and technological research: the case of polymer chemsitry the structure and function of complex networks the small world of human language collective dynamics of 'small-world' networks six degrees: the science of a connected age topology of the conceptual network of language the strength of weak ties centrality and power in organizations, networks and organizations: structure, form, and action centrality in social networks conceptual clarification bibliometric mapping of the computational intelligence field taiwan, and now visiting spru, university of sussex, uk. she is also an assistant researcher of science and technology policy researcher and information center key: cord-282035-jibmg4ch authors: dunbar, r. i. m. title: structure and function in human and primate social networks: implications for diffusion, network stability and health date: 2020-08-26 journal: proc math phys eng sci doi: 10.1098/rspa.2020.0446 sha: doc_id: 282035 cord_uid: jibmg4ch the human social world is orders of magnitude smaller than our highly urbanized world might lead us to suppose. in addition, human social networks have a very distinct fractal structure similar to that observed in other primates. in part, this reflects a cognitive constraint, and in part a time constraint, on the capacity for interaction. structured networks of this kind have a significant effect on the rates of transmission of both disease and information. because the cognitive mechanism underpinning network structure is based on trust, internal and external threats that undermine trust or constrain interaction inevitably result in the fragmentation and restructuring of networks. in contexts where network sizes are smaller, this is likely to have significant impacts on psychological and physical health risks. the processes whereby contagious diseases or information propagate through communities are directly affected by the way these communities are structured. this has been shown to be the case in primates [1] [2] [3] and has been well studied in humans in the form of epidemiological [4] and information diffusion (opinion dynamics or voter) models [5] . the ising phase state model (originally developed to describe the magnetic dipole moments of atomic spin in ferromagnetism) has been the workhorse of most of these models and of many of the models currently used to calculate the value of the r-number (or reproduction rate) used to drive current covid-19 management strategies. most early models were mean field models that assumed panmixia. however, human social networks are highly structured and small world: most people interact with only a very small number of individuals whose identities remain relatively stable over time. when it became apparent that the structure of networks could dramatically affect the flow of information (or infections) through networks [6, 7] , structure began to be incorporated into epidemiological models [8] [9] [10] [11] [12] . many of the best current models are 'compartmental models' which represent structure by the fact that a community consists of households or other small population units [11, 12] . in effect, these use spatial structure as a proxy for social structure, which has the advantage of ensuring that the models compute easily. in reality, of course, it is people's interactions with each other that give rise to the spatial structure represented by households. while it is true that most (but not all) individuals see and interact with household or family members more often than with anyone else, in fact this dichotomizes what is in reality a continuum of interaction that flows out in ripples from each individual. these ripples create social layers of gradually declining amplitude that spread through the local community well beyond the household. my aim in this paper is to examine the social and psychological processes that underpin natural human sociality in order to better understand how these affect both network structure and the way information or diseases propagate through them. like all monkeys and apes, humans live in stable social groups characterized by small, highly structured networks. individuals do not interact with, let alone meet, everyone else in their social group on a regular basis: a very high proportion of their interactions are confined to a very small subset of individuals. these relationships are sometimes described as having a 'bonded' quality: regular social partners appear to be fixated on each other [13, 14] . the mechanisms that underpin these relationships have important consequences for the dynamics of these networks. i will first briefly review evidence on the size and structure of the human social world. i will then explain how the cognitive and behavioural mechanisms that underpin friendships in all primates give rise to the particular form that human networks have. finally, i explore some of the consequences of this for information and disease propagation in networks, and how networks respond to external threats. humans have lived in settlements only for the past 8000 years or so, with mega-cities and nation states being at all common only within the last few hundred years. prior to that, our entire evolutionary history was dominated by very small-scale societies of the kind still found in contemporary hunter-gatherers. our personal social worlds still reflect that long evolutionary history, even when they are embedded in connurbations numbering tens of millions of people. table 1 summarizes the sizes of egocentric personal social networks estimated in a wide variety of contexts ranging from xmas card distribution lists (identifying all household members) to the number of friends on facebook, with sample sizes varying between 43 and a million individuals. the mean network size varies across the range 78-250, with an overall mean of approximately 154. table 1 also lists a number of studies that have estimated community size in a variety of pre-industrial societies as well as some contemporary contexts where it is possible to define a personalized community within which most people know each other on a personal level. these include the size of hunter-gatherer communities, historical european villages from the eleventh to the eighteenth centuries, self-contained historical communes, academic subdisciplines (defined as all those who pay attention to each other's publications) and internet communities. the average community sizes range between 107 and 200, many with very large sample sizes, with an overall mean of approximately 158. christmas card distribution list 43 153.5 [16] . [20] . the value of approximately 150 as a natural grouping size for humans was, in fact, originally predicted from an equation relating social group size to relative neocortex size in primates before this empirical evidence became available [38] . this prediction had a 95% confidence interval of 100-250, very close to the observed variance in the data. in primates as a whole (but not other birds and mammals), social group size is a function of neocortex volume, and especially the more frontal neocortex regions (the social brain hypothesis [39] ). in the last decade, neuroimaging studies of both humans [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] and monkeys [51, 52] indicate that the relationship between personal social networks (indexed in many different ways) and brain size also applies within species at the level of the individual as well as between species. the social brain relationship arises because primates are unusual in that they live in relatively large, stable, bonded social groups [53] . in contrast with the more casual groups (i.e. herds, flocks) of most mammals and birds, the problem with bonded groups is that they are, in effect, a version of the coupled oscillator problem. if animals' foraging and resting schedules get out of synchrony, . . some individuals will drift away when others go to rest, resulting in the fragmentation of the group [54, 55] . individuals have to be willing to accept short-term costs (mainly in relation to the scheduling of foraging) in order to gain greater long-term benefits (protection from predators by staying together). maintaining spatial coherence through time is cognitively difficult. it depends on two key psychological competences that appear to be unique to the anthropoid primates: the ability to inhibit prepotent actions (a prepotent response is the tendency to take a small immediate reward in preference to waiting for a larger future reward) and the capacity to mentalize. inhibition depends on the volume of the brain's frontal pole [56] , while mentalizing depends on a dedicated neural circuit known as the theory of mind network (also known as the default mode neural network) that integrates processing units in the brain's prefrontal, parietal and temporal lobes [57] , supplemented by connections with the limbic system [41, 42] . the frontal pole is unique to the anthropoid primates [56] ; the default mode network that underpins mentalizing is also common to both humans and monkeys [58] . maintaining group cohesion is not simply a memory problem (though it is commonly misunderstood as such). rather, it is one of predicting others' future behaviour under different conditions (e.g. knowing how others will respond to one's own actions) and being able to gauge the future reliability (trustworthiness) of other individuals [38, 59] . this is much more costly in terms of both neural activity and neural recruitment than simple factual recall [60] . in humans, the number of friends is directly correlated with mentalizing skills [44, 61] , and mentalizing skills are, in turn, correlated with the volumetric size of the brain's default mode neural network [62, 63] . the latter relationship has recently been shown to extend to primates as a whole [64] . social networks have generally been viewed from two different perspectives. network analysts with a statistical physics background have tended to view them top-down as macroscopic phenomena (i.e. from above looking down on the spatial distribution of a population of nodes), whereas sociologists have tended to view them from below as egocentric networks (the individual's experience of that population). on the whole, the first group have tended to focus on large-scale patterns in very large networks, often with an emphasis on triadic closure (heider's structural balance theory [65] ) as the glue that gives structure to a network; the second have focused on the micro-structure of individual's personal social networks, often focusing on the inner core of intimate friendships immediately beyond the simple triad. an important finding from the second approach has been that networks actually consist of a series of layers that correspond to relationships of different quality [16, 59, 66] . seen from an egocentric point of view, the frequency with which an individual contacts the members of their network does not follow a conventional power law distribution but, on closer inspection, contains a series of plateaux. cluster analyses of very large datasets invariably reveal that these personal networks contain four layers within them (figure 1). this gives the network a layered structure, where individual alters in a given layer are contacted with more or less similar frequency and there is a sharp drop-off in contact frequencies to the next layer. it turns out that, while there is some individual variation, these layers have quite characteristic sizes. moreover, when counted cumulatively, they have a very distinct scaling ratio: each layer is approximately three times the size of the layer immediately inside it (figure 2). this layered structure in figure 2 (referred to as a dunbar graph [67] ) has been identified, with virtually the same numerical layer values, in surveys of network size, the calling patterns in national cellphone databases, science co-author networks and the frequencies of reciprocated postings in both facebook and twitter (table 2). each layer seems to correspond to a very specific frequency of interaction (figure 3), and these frequencies are remarkably consistent across media [66] , suggesting both that they are hardwired and that communication media are substitutable. one way this structure might arise would be if the basal layer of five people represented, for example, a family or household, such that the next layer of 15 consists of three families with an especially close relationship, and the 50-layer beyond that consisted of three of these trios. this [19] . (b) optimal number of clusters identified by k-means clustering algorithm in three online datasets. reproduced from [68] . pattern, however, is likely to reflect small-scale traditional communities; in more mobile, postindustrial societies, figure 2 can arise simply a consequence of the patterns of interaction between individuals and need have no family-based (or spatial) underpinning to it at all. it is notable, nonetheless, that the same patterns emerge in either case, suggesting that there is an underlying universal constraint on how networks are constructed. this same pattern, with the same layer sizes, has also been identified in a number of topdown analyses of the social (or spatial) organization of human populations (table 2), including hunter-gatherers, the size distribution of irish bronze age stone circles (as estimates of local population distribution), the sizes of residential trailer parks, the structure of modern armies, the size of communities of practice in the business world and even the patterns of alliance formation in massive online multiplayer games (moms). this pattern, with the same scaling ratio, has also been noted in the political organization of a large sample of historical city states [69] . in fact, this layered structure with a consistent scaling ratio was first noted in an analysis of huntergatherer societies in the early 1980s by johnson [70] , who suggested that it was a 'span on control' management solution in response to internal stresses created as grouping size increases. more surprisingly, perhaps, these same numbers reappear in both the distribution of primate social group sizes [71] and in the layered structure of groups for those mammals that live in multilevel social systems (mainly baboons, chimpanzees, elephants and dolphins) [72, 73] (table 2) . animal societies with stable groups do not extend beyond the 50-layer, but all the internal layers are present. the fact that these numbers are so consistent across so wide a range of species perhaps suggests that they may be the outcome of natural scaling effects due to the structural stresses that social groups incur as they grow in size in response to ecological demands. as a result, social evolution in primates [74] occurs as a result of a stepwise increase in group size [71] achieved by bolting together several basal subgroups to create successive layers rather than through a continuous adjustment of group sizes as in most birds and mammals. primate species achieve larger groups by delaying group fission that would normally act as a nonlinear oscillator to keep group size within a defined range around the local mean [81] [82] [83] . the process thus seems to behave more like a series of phase transitions triggered by a natural fractionation process. although, in humans, there is remarkably little variation in both overall network size and layer sizes across samples, irrespective of sample size, sample origin and cluster detection algorithm, nonetheless within populations, there is considerable variation between individuals (figure 1a). some of this variation is due to sex (women tend to have larger inner layers than men [ but smaller outer layers [84] ), some due to age (network and layer sizes are an inverted-jshaped function of age, peaking in the 20s-30s [16, 24, 86] ) and personality (extroverts have larger networks at all layer levels than introverts [85] [86] [87] [88] ). in addition, all human social networks are divided into two components, family and friends. although in small-scale societies, virtually everyone in your network is a member of your extended family, in the post-industrial world with our lower birth rates, the typical network is split roughly 50 : 50 between family and friends [59, 66, 89] . however, it still seems that preference is given to family: those who come from large extended families have fewer friends [17, 66] . in effect, family and friends form two largely separate subnetworks that are interleaved through the layers of the network, with a roughly even split in the two innermost 5-and 15-layers, friends predominating in the middle 50-layer and family predominating in the outer 150-layer [59] . the latter seems to reflect the fact that friends are more costly to maintain in terms of time investment than family members [17] , and hence survive less well in the outermost layer (see below). conventional top-down networks tend to focus on the degree of individual ego's, usually with some kind of cut-off to define how many primary contacts an individual has. irrespective of where the cut-off is taken to be, these relationships tend to be viewed as one-of-a-kind. dunbar graphs, by contrast, recognize that individuals have relationships of different quality with many individuals, which might be viewed as primary, secondary, tertiary, etc. relationships. the first will usually be a subset of the second. in this section, i provide a brief explanation of how the primate bonding process works. the main purpose is to stress that it is both complex and time-consuming. this will help explain some of the patterns we will meet in the following two sections where i discuss network dynamics and their consequences. primate social groups are implicit social contracts designed to ensure protection from predators and, secondarily, rival conspecifics through group augmentation selection effects [90] . group-living is often mistaken for a cooperation problem, but it is in fact a coordination problem. cooperation problems invariably involve a public goods dilemma (cooperators pay an upfront cost), whereas a coordination problem does not (you are either in the group or not, and everyone pays the same simultaneous cost) [91] . the problem animals have to solve is how to maintain group stability (i.e. coordination) in the face of the stresses that derive from living in close proximity [81, 83, 92] which would otherwise cause the members to disperse (as happens in herd-and flock-forming species [54, 55] . the primate solution to this problem is bonded relationships, since this ensures that individuals will maintain behavioural synchrony and stay together as a group. in primates, relationships are built up by social grooming [93] . grooming is an exceptionally time-costly activity, with some species devoting as much as one-fifth of their entire day to this seemingly trivial activity [93] . grooming creates a sense of reciprocity, obligation and trust (operationally, a form of bayesian belief in the honesty and reliability of an alter's future behaviour [94] ). the layered structure of human social networks is a consequence of how we choose to distribute our social time around the members (figures 2 and 3). in both monkeys [92] and humans [59, 94] , the quality of a relationship (indexed by its likelihood of producing prosocial support when it is needed) depends directly on the time invested in it. however, the time available for social interaction is severely limited [17, 85, 95] , and this forces individuals to make trade-offs between the benefits offered by relationships of different strength and the costs of maintaining those relationships. the process involved is a dual process mechanism that involves two separate mechanisms acting in parallel. one is a psychopharmacological mechanism acting at the emotional (raw feels) level that is mediated by the way social grooming triggers the brain's endorphin system (part of the pain management system); the other is the cognitive element that forms the core of the social brain itself [96] . social grooming is often assumed to have a purely hygienic function. while it certainly has this effect, in primates it has been coopted to form a far more important function in social bonding. the action of leafing through the fur during grooming triggers the endorphin system in the brain [97, 98] via a specialized neural system, the afferent c-tactile (or ct) fibres [99] . these are highly specialized nerves that respond only to light, slow stroking at approximately 2.5 cm s â��1 (the speed of hand movements in grooming), and nothing else [100] . endorphin activation and uptake in the brain creates an opioid-like sense of relaxation, contentment and warmth [101, 102] that seems to provide a psychopharmacological platform for bondedness [101, [103] [104] [105] off which the second process, a cognitive sense of trust and obligation, is built. endorphins have a relatively short half-life (around 2.5 h), and so the system needs constant activation via grooming to maintain the requisite bonding levels, thereby making the system very time-costly. physical touch in the form of stroking and casual touch continues to form an important part of human social interaction and yields exactly the same endorphin effect [98] as it does in primate grooming. however, physical contact has an intimacy that limits it mainly to the inner layers of our networks [106, 107] . moreover, it is physically impossible to groom with more than one individual at a time with the same focused attention that seems to be important in its execution. this, combined with the constraints on time and the minimum time required to maintain a working relationship, ultimately places a limit on the number of relationships an animal can bond with using only this mechanism. in primates, this limits group size to about 50 individuals (the upper limit on average species group size in primates [68] ). groups larger than this are prone to fragmentation and ultimately to fission [108] . in order to be able to increase group size, humans discovered how to exploit other behaviours that also trigger the endorphin system in a way that is, in effect, a form of grooming-at-adistance. these include laughter [109] , singing [110] , dancing [111] , emotional storytelling [112] , and communal eating [113, 114] and drinking (of alcohol) [115] , all of which trigger the endorphin system and do so incrementally when done in synchrony [116, 117] . because they do not involve direct physical contact, more individuals can be 'groomed' simultaneously, thereby allowing a form of time-sharing that makes it possible to reach more individuals and so increase group size. the second component of this system is a cognitive mechanism. it centres around the knowledge of other individuals that can be built up by being in close contact. evolutionary studies of cooperation tend to view relationships as a form of debt-logging. such knowledge would not necessarily require frequent interaction since that can be done by third-party observation. rather, interacting closely with others allows an individual to get to know them well enough to predict their behaviour-to know that they really will come to your aid when you really need them, not because they owe you a favour but because they have a sense of obligation and commitment to you. in effect, it creates a sense of trust that acts as a rapid, intuitive (albeit imperfect) cue of reliability. in humans, this has been augmented by a capacity to build a more detailed 'picture' of another individual through conversation in a way that short circuits the need to invest excessive amounts of time in getting to know them. in other words, we can form a near-instantaneous impression of a stranger and use that as the basis for decisions about whether or not to engage with them. we do this by evaluating an individual's status on a set of largely exogenous cultural dimensions known as the seven pillars of friendship [96] . the seven pillars are: language (or, better still, dialect), place of origin, educational trajectory, hobbies and interests, worldview (religious/moral/political views), musical tastes and sense of humour. the more of these we share in common with someone, the stronger the relationship between us will be and the more altruistic we will be to each other [118] . this 'birds of a feather flock together' phenomenon is termed homophily [119] . the seven pillars are cues of membership of the same community. in small-scale societies, they would identify an extended kinship group, where kin selection (the 'kinship premium' [120] ) provides an additional guarantee of trustworthiness [94] . in effect, they function as a cultural totem pole at the centre of the metaphorical village green on which the community can hang its hats-an emotional consensus of who we are as a community and how we came to be that way, a way of building up mutual trust. in the contemporary context, it still identifies your (now reduced) kin group, but it also seems to identify the small community where you spent your formative years-the period when you acquired your sense of who you are and what community you belong to. this is the community whose mores and behaviour you understand in a very intuitive way, and it is this understanding that determines your sense of how well you can trust its members. homophily in friendships has also been documented in respect of a number of endogenous traits, including gender [17, 24, 66, 121] , ethnicity [122] and personality [85, 123] . gender has a particularly strong effect: approximately 70% of men's social networks consist of men, and approximately 70% of women's networks consist of women, with most of the opposite sex alters in both cases being extended family members. one reason for this is that the two sexes' style of social interaction is very different. women's relationships are more dyadic, serviced mainly by conversation and involve significantly more physical touch, whereas men's relationships are more group-based, typically involve some form of activity (sports, hobbies, social drinking) rather than conversation, and make much less use of haptic contact [106, 124, 125] . men's friendships are also typically less intense, more casual and more substitutable than women's: women will often try to keep core friendships going long after one of them has moved away (e.g. through e-mail or facebook), whereas men simply tend to find someone else to replace the absent individual(s) in a rather out-of-sight-out-of-mind fashion. homophily enables interactions (and hence relationships) to 'flow', both because those involved 'understand' each other better and because they share strategic interests. between them, these emotional and cognitive components create the intensity of bonds that act as the glue to bind individuals together. human, and primate, friendships are often described in terms of two core dimensions: being close (desire for spatial proximity) and feeling close (emotional proximity) [126] . between them, these ensure that bonded individuals stay together so that they are on hand when support is needed, thereby ensuring group cohesion through a drag effect on other dyads created by the ties in the network. since bonding, and the time investment this is based on, is in principle a continuum, this will not, of itself, give rise to the layered structure of a dunbar graph. to produce this, two more key components are needed: a constraint on time and the fact that friendships provide different kinds of benefits. the next section explains how this comes about. two approaches have been used to understand why social networks might have the layered structures they do. one has been to use agent-based models to reverse engineer the conditions under which the intersection between the time costs of relationships and the benefits that these provide give rise to structured networks. the other has been to solve the problem analytically as an optimal investment decision problem. in many ways, these represent top-down and bottom-up approaches, with the first focusing on the macro-level exogenous conditions that produce layered networks, and the second focusing on the micro-decisions that individuals have to make within these environments when deciding whom to interact with. building on the time budget models of dunbar et al. [95] , sutcliffe et al. [127] conceived the problem as the outcome of trying to satisfy two competing goals (foraging and socializing, where socializing provides a key non-foraging ecological benefit such as protection against internal or external threats) in a time-constrained environment. the aim was to identify which combination of strategies and cost/benefit regimes reproduced the exact layer structure of human social networks (for these purposes, taken as the 5, 15 and 150 layers) when agents try to maximize fitness (with fitness being the additive sum of five proximate components that reward social strategies differentially). in the model, 300 agents armed with different strategic preferences for investing in close versus distant ties interacted with each other to form alliances that provided access to different resources. in each run, the population initially consisted of equal numbers of agents pursuing each of three investment preference strategies. following the conventional design for models of this kind, at the end of each round (generation) the 20% of agents with the lowest fitness were removed (died) and were replaced by offspring produced by the top 20% (duplicating the parent's strategy), thereby allowing the size of the population to remain constant but its composition to evolve. the model was allowed to run until the distribution of strategy types reached an equilibrium (typically 2000 cycles). the outcomes from greater than 3000 runs (in which weightings on the fitness functions were systematically varied) were sorted into clusters using k-mean cluster analysis with quantitative fit to the dunbar numbers as the criterion. numerically, strategies that favour having just a few strong ties dominate the fitness landscape, yielding the kinds of small monogamous or harem-based societies that are widely common in mammals. the second most common strategy was the one where agents have no preferences for forming close ties of any kind and instead form large, anonymous herds with no internal structure similar to those characteristic of herd-forming mammals. by contrast, multilevel social structures of the kind found in many primates, and especially those with the specific layer sizes of human social networks, were extremely rare (accounting for less than 1% of runs). they occurred only under a very limited set of circumstances, namely when a capacity for high investment in social time (i.e. sufficient spare time over that needed for foraging), preferential social interaction strategies, high mortality risk and steep reproductive differentials between individuals coincided. these are all traits characteristic of primates. the alternative approach considers the investment decisions themselves. tamarit et al. [128] developed a one-parameter bayesian statistical urn model in which individuals choose to invest their limited social capital in relationships that provide different benefits (identified here as network layers). the model seeks to optimize the distribution of effort across relationship types where s k is the cost of a relationship in layer kâ��1,2 . . . r, k is the number of alters assigned to layer k (with l = î£l k ), s is the total amount of resource (time) available, n is the total population size and âµ is the lagrange multiplier associated with the constraint imposed by total resources. when the cost (i.e. the time investment that has to be made in a relationship) is a monotonic negative function of layer (as in figure 3 ), this yields layered structures of exactly the kind found in human social networks-few close ties and many weak ones (the lower curve in figure 4 ). however, it turned out that the structure of the network inverts (more close ties and few distant ties) when the lagrange multiplier falls below âµ = 0 (upper curve in figure 4 ). one context in which this might happen is when the available community of alters to choose from is small, and so there is spare resource capacity per available alter. a comparison of migrant versus host communities in spain revealed that there is indeed a phase transition between conventionally structured networks (layers with a concave structure, i.e. a few close friends, more intermediate friends and many casual friends) to inverted networks (convex structures, in which ego has more close friends, some intermediate friends and only a few casual friends) at âµ â�� 0 [128] . this had not been noted previously because in the normal population, most individuals fall in the âµ > 0 region; the very small number falling in the âµ < 0 region had simply been viewed as statistical error. however, most migrants, who typically have fewer social opportunities, fall in the âµ < 0 region and so have inverted (convex) networks as a result, with only a small number having standard (i.e. concave) networks. this seems to suggest that, when surplus time is available, people prefer to invest more heavily in their closest ties rather than distributing their effort more evenly because the emotional and practical support offered by these inner layer ties is more valuable to them. this may explain the structure of primate groups where the inner core of strong ties typically represents a larger proportion of group size in species that live in smaller groups. the fact that the scaling ratio is consistently approximately 3 in human (and primate) social networks raises the possibility that heider's structural balance triads might explain the layered structuring. this possibility was considered by klimek et al. [129] who studied an ising-type coevolutionary voter model in a context where there are several social functions (say, friendship and business alliances) such that there are separate linked networks where most individuals occupy positions in each network. they showed that when the networks (i.e. functions) vary in rewiring probability (slow versus fast turnover in relationships), the single large network will eventually undergo a phase transition known as shattered fragmentation in which the community fractures into a large number of small subnetworks (or cliques). this happens only when one of the rewiring frequencies reaches a critically high level. when klimek et al. examined data from the pardus online mom game world, they found that the slow rewiring network (friendship) produced a weakly multimodal right-skewed, fattailed distribution with modal group sizes at 1-2 and approximately 50 players with a few very large super-communities centred around 160 or 1200 members. by contrast, the two fast rewiring networks (in this context, trading and communication functions) both underwent fragmentation into a large number of smaller subnetworks, with a single peak in group size at approximately 50 in both cases, just as the model predicts. when the three networks were projected onto a single multidimensional mapping, very distinct peaks emerged in the distribution of group sizes in both model and data at approximately 40, 150 and 1200, much as we find in table 2. this seems to suggest that when triadic closure is a criterion for relationship stability and there is more than one criterion by which individuals can form ties, layering emerges naturally in networks through self-organizing fragmentation effects. so far, we have considered hierarchically inclusive layer structures. in these, the whole population is contained in the lowest layer, and the higher layers are created by successively bolting together, layer by layer, the groupings that make up each lower layer rather in the way military units are structured [130] . most social networks seem to work like this. however, layers can also arise when some individuals are allocated positions of status, so that the members of the community are distributed across different layers with most individuals in the base layer and a few individuals in one or more higher layers. networks of this kind are characteristic of management structures and the kinds of social hierarchies found in feudal societies. layered structures of this kind seem to emerge rather easily when individuals differ in their willingness to compromise. dã¡vid-barrett & dunbar [131] used an agent-based model to investigate the processes of group coordination when a community has to converge on an agreed compass direction (a proxy for any communal action or opinion that has the advantage of allowing up to 360 different values to be held rather than just two as in more conventional ising models), but one group member is so convinced they have the right answer that they refuse to compromise. if agents can assign weightings to each other on the basis of some preference criterion, however arbitrary, a layered structure emerges with an 'elite' subgroup that acts, in effect, as a management clique. multilevel structures of this kind have the advantage that they increase the speed with which decisions are adopted. multilayer networks are optimal when the costs associated with maintaining relationships, combined with the costs of information flow, are high. in such cases, a social hierarchy can be adaptive: when the hierarchy is steep, information needs to traverse fewer relationships (shorter path lengths), either because the elite effectively act as bridges between lower level groups (distributed management) or because the elite imposes its decisions on the individuals in the lower strata (dictatorial management). falling communication costs lead to a less steep hierarchy when socially useful information is evenly distributed, but to a steeper hierarchy when socially useful information is unevenly distributed. in human social networks, the layers have very characteristic interaction frequencies with ego ( figure 3) . approximately 40% of all social effort (whether indexed as the frequency or duration of interaction) is directed to the five individuals in the closest layer, with another 20% devoted to the remaining 10 members of the second layer. thus, 60% of social time is devoted to just 15 people. comparable results have been reported from large-scale surveys in the uk [8] and in china [10] . this will inevitably affect the rate with which information, innovations or disease propagate through a network. however, network structure can speed up or slow down the rate of propagation, depending on the precise nature of the social processes involved. in a very early (1995) analysis of this, we used boyd & richerson's [5] mean field ising model of cultural transmission to study what happens when these kinds of models are applied to structured networks [130] . in the model, individuals acquire their information from n cultural parents, each of whom can differ in their cultural state. the model was run with a population of 10 000 agents mapped as nodes on a 100 * 100 lattice wrapped on a torus so as to prevent edge effects. structure was imposed by allowing nodes to interact only with their eight closest neighbours on the lattice. on a regular lattice, these consist of two distinct sets: direct contacts (the four adjacent nodes on the main diagonals) and indirect contacts (the four corner nodes that can only be reached indirectly through the four adjacent nodes). in effect, these are equivalent to friends and friends-of-friends. at each generation, a node can change its cultural variant either by mutation or by imitation from one of its neighbouring nodes, with transition probabilities determined by a three-element vector specifying node-specific values of the boyd-richerson cultural inheritance bias functions (one reflecting the self-mutation rate, the other two the transmission, or copying, rates from the four 'friends' and the four 'friends-of-friends', with the proviso that all three sum to 1). when the spatial constraint is dropped and everyone is allowed to interact with everyone else, the model replicates exactly the findings of the boyd-richerson [5] cultural inheritance model. the population evolves to full penetrance by a mutant cultural variant initially seeded at just one node (i.e. with a probability of occurrence of just 0.0001) in 75-150 generations. with spatial (or social) constraints in place, however, two important effects emerge. first, depending on the steepness of the inheritance bias functions, 44-60% of mutant seedings went extinct before achieving full penetrance, apparently because they often became trapped in eddies at particular locations and could not break out before going extinct. in those runs where the mutant achieved full penetrance (i.e. all nodes became mutants), the time to penetrance was 150-300 generations for the same set of transmission biases. in other words, the mutant trait took far longer to spread through the population. once again, the time taken to break out of local eddies was the main reason for the much slower penetrance. the difference between these runs and those where the mutant went extinct depended on the balance between the stochastic rates at which new 'infected' clusters were created and died out. if a local extinction occurred early in the system's evolution when few mutant clusters had become established, global extinction was more likely (the classic small population extinction risk phenomenon in conservation biology [132] ). changing population size or the number of cultural 'parents' had a quantitative effect, but did not change the basic pattern. the cumulative probability asymptotes at a network size of approximately 150, but the optimum number of alters is 50 (identified by the point at which the curve begins to decelerate, defined by e â��1 of the way down from the asymptote) since after this, the benefits of increasing network size diminish exponentially. reproduced from [130] . (b) effect on reachability of removing nodes in different layers of egocentric twitter graphs: the larger the effect, the greater the disruption on information flow. the horizontal dotted lines demarcate 1/eth from the asymptote, and the vertical dotted line the optimal group size for information diffusion. data from [133] . penetrance was slower if there were fewer cultural 'parents', for example, or if the population size was larger. to explore the rate at which information flows through a community, dunbar [130] modelled the likelihood that ego (at the centre of the network) would hear about a novel innovation via direct face-to-face contact for communities of different size (5, 15, 50, 150 , 500, 1500 and 5000 individuals), given the layer-specific rates of contact shown in figure 3 (extrapolated out to the 5000 layer). the probability, p i , of hearing about an innovation seeded somewhere in a network with i layers is the conjoint probability of encountering any given individual in a network layer of size n i and the likelihood that any one of these individuals would have the trait in question (i.e. be infected), summed across all layers where n i is the number of individuals in the ith annulus (network layer), r k is the likelihood that any one of them will have the trait (here taken to be constant, and equivalent to r k = 0.01), c i [f2f] is the likelihood of contacting any given individual face-to-face. figure 5a plots the results. the probability of acquiring information reaches an asymptotic value at a community size of approximately 1500, with no further gain in the likelihood of hearing about an innovation as community size increases beyond this. the optimal community size for information transmission can be identified by the inflection point (the point at which the marginal gain begins to diminish). with a graph of this form, this occurs at the value on the x-axis when the asymptotic value on the y-axis is reduced by 1/e. this is at a community size of exactly 50. the gains to be had by increasing community size beyond approximately 50 diminish exponentially and become trivial beyond a community size of approximately 150 individuals. this was later confirmed by arnaboldi et al. [133] who modelled information diffusion in actual twitter networks. figure importantly, and contrary to granovetter's [134] well-known claim, it seems that it is the inner layers (stronger ties) that may have most impact on the likelihood of acquiring information by diffusion, not the outermost layer (weak ties). the outermost 150-layer (which is disproportionately populated by distant kin [59, 135] ) presumably serves some other function such as a mutual support network [133] . this finding appears to conflict with earlier analyses [6, 137] that have emphasized the importance of weak links (long-range connections) in the rate at which infections or information are propagated in networks. the issue, however, hinges on which layers are counted as strong (short-range) and which weak (long-range). previous analyses, including granovetter himself, have tended to be unspecific on this point. if what he had in mind as weak ties was the 50-layer, then his claim holds; if he was thinking of the 150-or even 5000layer, then it seems he was wrong. even so, it seems that the information value of 50-layer ties is considerably less than that of alters in the 5-and 15-layers, who are also likely to be considered more trustworthy sources of information. nonetheless, granovetter might still be right if either of two conditions hold. one is that the analyses considered only ego acquiring information by direct personal contact; the models did not consider the impact of upward information flow through the network from the source of the information towards ego. the other is that granovetter might have been right for the wrong reason: the function of networks is not information flow (or acquisition) but the provision of direct functional support such as protection against external threats or sources of economic support (a view which would accord better with the view of primate social systems elaborated in â§4). in other words, as is the case in primate social systems, information flow is a consequence of network structure, not its driving force in terms of evolutionary selection [39] . it may, nonetheless, be that the 150-layer provides the principal access channels to the global network beyond the individual's primary personal social sphere. this is suggested by an analysis that used m-polynomials derived from chemical graph theory to integrate dunbar graphs into the milgram small world 'six degrees of separation' phenomenon. the capacity to reach an unknown remote individual in 4-6 links is only possible if, at each step in the chain, the message-holder can access a large number of alters in their personal network [67] . however, this analysis only considered 15 versus 150 network contacts, and 150 significantly over-engineers the solution. further work is needed to explore the optimal network size and structure for transmission in more detail. a moot point, of course, is whether the capacity to send letters to a remote stranger is ever of any real functional value to us, or simply an amusing but unimportant by-product of the way personal networks interface with each other in global networks. one complicating aspect of real social networks not usually considered in these models is the fact that social subnetworks are characterized by a high level of homophily, especially in the inner layers. in other words, people's friends tend to resemble them on an array of social and cultural dimensions [118, 119, 123] . analysing a large (20 million users) cellphone dataset, miritello et al. [85] differentiated, on the basis of calling patterns, two distinct phenotypes: 'social keepers' who focused their calls on a small, stable cohort of contacts (introverts?) and 'social explorers' who had a larger number of contacts with high churn (social butterflies, or extraverts?). each tended to exhibit a strong preference for its own phenotype. assuming that phone contact rates mirror face-to-face contact rates (as, in fact, seems to be the case [68, 133, 138] ), explorers were more likely to contact an infected individual because they were more wide-ranging in their social contacts. keepers remained buffered by their small social radius for longer. this reinforces the suggestion made earlier that innovations frequently go extinct in structured networks because they get trapped in eddies created by network structuring and risk going extinct before they can escape into the wider world. the role of extraverts in facilitating information flow was also noted by lu et al. [139, 140] in a study of networks parametrized by personality-specific contact rates from the community studied by [125] . they found that information flow was more efficient if the network consisted of two distinct phenotypes (in this case, actual introvert and extravert personality types) than if all community members were of the same phenotype. in large part, this was because extraverts (those who opted to prioritize quantity over quality of relationships) acted as bridges between subnetworks, thereby allowing information to flow more easily through the network as a whole. much of the focus in network dynamics has been on disease propagation. in most models, networks are assumed to remain essentially static in structure over time. this may not always be the case, since network structure may itself respond to both internal threats (stress or deception) and external threats (such as disease or exploitation or attack by outsiders). this is because threats such as disease or exploitation cause a breakdown in trust and trust is, as we saw, central to the structure of social networks. other factors that might cause networks to restructure themselves include a reduction in the time available for social interaction, access to a sufficiently large population to allow choice [128] or a change in the proportion of phenotypes (sex, personality or family size) when these behave differently. methods for studying networks that change dynamically through time have been developed [142] , although in practice these typically reflect past change rather than how networks are likely to respond to future challenges. here, my concern is with how networks might change as a consequence of the internal and external forces acting on them. because of the way relationships are serviced in social networks ( â§4), a reduction in time devoted to a tie results in an inexorable decline in the emotional strength of a tie, at least for friendships ( figure 6) . note, however, that family ties appear to be quite robust in the face of lack of opportunity to interact. figure 6 suggests that this effect happens within a matter of a few months (see also [143, 144] ). saramã¤ki et al. [145] reported a turnover of approximately 40% in the network membership of young adults over an 18-month period after they had moved away from their home town, most of which occurred in the first nine months. a similar effect will occur when there is a terminal breakdown in a relationship. these seem to occur with a frequency of about 1% of relationships per year, though it is clear that some people are more prone to relationship breakdown than others [146] . most such breakdowns occur because of a breakdown in trust [59, 146] . although almost never considered in models of network dynamics, the division between family and friends can have significant consequences for the dynamics of networks, especially when comparing natural fertility (no contraception) with demographic transition fertility regimes (those that actively practise contraception). friendships require significantly more time investment than family relationships to maintain at a constant emotional intensity, especially so in the outer layers [147, 148] , and because of this are more likely to fade (and to do so rather quickly) if contact is reduced [147] ( figure 6 ). family relationships, on the other hand, are more forgiving of breaches of trust and underinvestment. in addition, when family relationships breakdown, they are apt to fracture catastrophically and irreparably [146] , creating structural holes. by contrast, most friendships die quietly as a result of reduced contact, in many cases because they are simply replaced by alternatives. this probably means that, when a network is under threat, friendship ties are more likely to be lost than family ties. this would seem to be born out by casual observation of the response to the covid-19 lockdown: virtual zoom-based family gatherings seem to be much more common than friendship-based meetings. under normal circumstances, the gaps left by the loss of a tie following a relationship breakdown are filled by moving someone up from a lower layer or by adding an entirely new person into the network from outside. saramã¤ki et al. [145] noted that, when this happens, the new occupant of the vacated slot is contacted with exactly the same frequency as the previous occupant, irrespective of who they are. it seems that individuals have a distinctive social fingerprint, and this fingerprint is very stable across time [145] . however, if the opportunity for social interaction is restricted, or there is widespread breakdown in the level of trust (as when many people cease to adhere to social rules, or a culture of deception or antisocial behaviour evolves), then the inevitable response is for networks to fragment as individuals increasingly withdraw from casual contacts and focus their attention on those whom they trust most (normally the alters in the innermost layers of their network). iã±iguez et al. [149] and barrio et al. [150] modelled the effect of two kinds of deception (selfish lies versus white lies) on network structure. selfish lies are those that benefit the liar, while white lies are those that benefit either the addressee or the relationship between the liar and the addressee (e.g. 'likes' on digital media). these two phenotypes differ radically in the effect they have on the relationship between the individuals concerned: the first will cause a reduction in the frequency of contact resulting in a fragmentation of the network, whereas the second often reinforces network cohesion. if networks shrink sufficiently under stress, they may invert (figure 4). there is indirect evidence for this in the effect that parasite load has on the size of communities in tribal societies: these decline in size the closer they are to the equator (the tropics being the main hotspot for the evolution of new diseases), and this correlates in turn with a corresponding increase in the number of languages and religions, both of which restrict community size [151, 152] . at high latitudes, where parasite loads tend to be low and less stable climates make long-range cooperation an advantage, community sizes are large, people speak fewer languages and religions tend to have wider geographical coverage, which, between them, will result in more extensive global networks [151, 152] . similar effects have been noted in financial networks, where network structure between institutions that trade with each other also depends on trust. there has been considerable interest in how network structure might influence the consequences of contagion effects when financial institutions collapse. network structure can affect how shocks spread through networks of banks, giving rise to default cascades in ways not dissimilar to the way diseases propagate through human social networks. although well-connected banks may be buffered against shocks because of the way the effects are diluted [153] [154] [155] much as a well-connected individuals may be buffered against social stresses, a loss of trust between institutions invariably results in the contraction of networks, associated with more conservative trading decisions and a greater reluctance to lend [156] in ways reminiscent of social networks fragmenting in the face of a loss of trust. if effective network size (i.e. the number of ties an individual has) is reduced as a result of such effects, more serious consequences may follow at the level of the individual for health, wellbeing and even longevity. smaller social networks are correlated with increasing social isolation and loneliness, and loneliness in turn has a dramatic effect on morbidity and mortality rates. there is now considerable evidence that the number and quality of close friendships that an individual has directly affects their health, wellbeing, happiness, capacity to recover from surgery and major illness, and their even longevity (reviewed in [96, 157] ), as well as their engagement with, and trust in, the wider community within which they live [114, 115] . indeed, the effects of social interaction can even outweigh more conventional medical concerns (obesity, diet, exercise, medication, alcohol consumption, local air quality, etc.) as a predictor of mortality [158] . most epidemiological studies have focused on close friends, but there is evidence that the size of the extended family can have an important beneficial effect, especially on children's morbidity and mortality risks [159] . these findings are mirrored by evidence from primates: the size of an individual's intimate social circle has a direct impact on its fertility, survival, how quickly it recovers from injury, and ultimately its biological fitness [160] [161] [162] [163] [164] [165] [166] . it is worth noting that dunbar graphs, with their basis in trust, have been used to develop online 'secret handshake' security algorithms for use in pervasive technology (e.g. safebook [167, 168] ). pervasive technology aims to replace cellphone masts by using the phones themselves as waystations for transmitting a call between sender and addressee. the principal problem this gives rise to is trust: a phone needs to be able to trust that the incoming phone is not intent on accessing its information for malicious purposes. safebook stores the phone owner's seven pillars as a vector which can then be compared with the equivalent vector from the incoming phone. a criterion can be set as to how many 'pillars' must match for another phone to be considered trustworthy. the dunbar graph has also been used to develop a bot-detection algorithm by comparing a node's network size and shape with that of a real human (i.e. a dunbar graph): this algorithm out-performs all other currently available bot-detection algorithms [169] . we might ask what effects we might expect in the light of this from the lockdowns imposed by most countries in 2020 in response to covid-19. i anticipate four likely effects. one is that if lockdown continues for more than about three months, we may expect to see a weakening of existing friendships, especially in groups like the elderly whose network sizes are already in age-dependent decline. since older people find it more difficult to make new friends, an increased level of social isolation and loneliness is likely to result, with consequent increases in the diseases of old age, including dementia and alzheimer. second, we may expect to see an increased effort to recontact old friends, in particular, immediately after lockdown is lifted. we already see evidence for this in telephone call patterns: if there is a longer than normal gap before an alter is called again, the next call is significantly longer than average as though attempting to repair the damage to relationship quality [170] . third, the weakening of friendship quality can be expected (i) to make subsequent meetings a little awkward because both parties will be a little unsure of how they now stand when meeting up again and (ii) to result in some churn in networks where new friendships developed through street-based covid community groups are seen as more valuable (and more accessible) than some previous lower rank friendships. finally, we may expect the fear of coronavirus contagion (an external threat) to result in a reduction in the frequency with which some individuals (notably introverts and the psychologically more cautious) visit locations where they would come into casual contact with people they do not know. they may also reduce frequencies of contact with low-rank friends, and perhaps even distant family members, whose behaviour (and hence infection risk) they cannot monitor easily. this is likely to result in more inverted networks, as well as networks focused mainly on people who are more accessible. although this effect will weaken gradually with time, and network patterns are likely to return to pre-covid patterns within 6-12 months, some friendship ties may have been sufficiently weakened to slip over the threshold into the 500 (acquaintances) layer. my aim in this paper has been to introduce a rather different perspective on the social structure of communities than that normally considered in disease propagation models, and to explain the forces that underpin real-world social networks. i have presented evidence to suggest that human social networks are very much smaller and more highly structured than we usually assume. in addition, network size and composition can vary considerably across individuals as a function of sex, age, personality, local reproductive patterns and the size of the accessible population. while casual contacts might be important for the spread of highly infectious diseases [ social time is actually devoted to no more than 15 individuals and this will slow down rates of transmission if physical contact or repeated exposure to the same individual is required for successful infection. both external and internal threats can destabilize network ties by affecting the level of trust, causing networks to contract and fragment. if networks fragment under these threats, there will be knock-on consequences in terms of health, wellbeing and longevity. data accessibility. this article has no additional data. competing interests. i declare i have no competing interests. funding. most of the research reported here was funded by the uk epsrc/esrc tess research grant, british the gut microbiome of nonhuman primates: lessons in ecology and evolution reciprocal interactions between gut microbiota and host social behavior in press. the role of the microbiome in the biology of social behaviour infectious diseases of humans: dynamics and control culture and the evolutionary process collective dynamics of 'small-world' networks the effects of local spatial structure on epidemiological invasions dynamic social networks and the implications for the spread of infectious disease social encounter networks: characterizing great britain social mixing patterns in rural and urban areas of southern china household structure and infectious disease transmission threshold parameters for a model of epidemic spread among households and workplaces bondedness and sociality close social associations in animals and humans: functions and mechanisms of friendship measuring patterns of acquaintanceship social network size in humans exploring variations in active network size: constraints and ego characteristics activity in social media and intimacy in social relationships calling dunbar's numbers comparative analysis of layered structures in empirical investor networks and cellphone communication networks from 150 to 3: dunbar's numbers communication dynamics in finite capacity social networks do online social media cut through the constraints that limit the size of offline social networks? analysis of co-authorship ego networks an atlas of anglo-saxon england the world we have lost group size in social-ecological systems coevolution of neocortex size, group size and language in humans the complex structure of hunter-gatherer social networks we're all kin: a cultural study of a mountain neighborhood genetics: human aspects sex differences in mate choice among the 'nebraska' amish of central pennsylvania let my people grow (work paper 1) company academic tribes and territories modeling users' activity on twitter networks: validation of dunbar's number the social brain hypothesis why are there so many explanations for primate brain evolution? phil the brain structural disposition to social interaction amygdala volume and social network size in humans intrinsic amygdala-cortical functional connectivity predicts social network size in humans ventromedial prefrontal volume predicts understanding of others and social network size 2012 orbital prefrontal cortex volume predicts social network size: an imaging study of individual differences in humans online social network size is reflected in human brain structure neural connections foster social connections: a diffusion-weighted imaging study of social networks social brain volume is associated with in-degree social network size among older adults the structural and functional brain networks that support human social networks gray matter volume of the anterior insular cortex and social networking 2020 10,000 social brains: sex differentiation in human brain anatomy social network size affects neural circuits in macaques in press. baboons (papio anubis) living in larger social groups have bigger brains the evolution of the social brain: anthropoid primates contrast with other vertebrates sexual segregation among feral goats: testing between alternative hypotheses sex differences in feeding activity results in sexual segregation of feral goats the neurobiology of the prefrontal cortex: anatomy, evolution and the origin of insight social cognition and the brain: a meta-analysis. hum. brain mapp the extreme capsule fiber complex in humans and macaque monkeys: a comparative diffusion mri tractography study relationships and the social brain: integrating psychological and evolutionary perspectives higher order intentionality tasks are cognitively more demanding perspective-taking and memory capacity predict social network size orbital prefrontal cortex volume correlates with social cognitive competence different association between intentionality competence and prefrontal volume in left-and righthanders reading wild minds: a computational assay of theory of mind sophistication across seven primate species the psychology of interkpersonal relations social networks, support cliques and kinship 2020 on m-polynomials of dunbar graphs in social networks the structure of online social networks mirrors those in the offline world organizational complexity and demographic scale in primary states organizational structure and scalar stress primate social group sizes exhibit a regular scaling pattern with natural attractors network scaling reveals consistent fractal pattern in hierarchical mammalian societies neocortex size and social network size in primates stepwise evolution of stable sociality in primates discrete hierarchical organization of social group sizes 2020 the fractal structure of communities of practice: implications for business organization sizes of permanent campsites reflect constraints on natural human communities fractal multi-level organisation of human groups in a virtual world stone circles and the structure of bronze age society how many faces do people know? group size as a trade-off between fertility and predation risk: implications for social evolution tradeoff between fertility and predation risk drives a geometric sequence in the pattern of group sizes in baboons fertility as a constraint on group size in african great apes sex differences in social focus across the life cycle in humans limited communication capacity unveils strategies for human interaction individual differences and personal social network size and structure extraverts have larger social network layers but do not feel emotionally closer to individuals at any layer personality traits and ego-network dynamics 2017 fertility, kinship and the evolution of mass ideologies group augmentation and the evolution of cooperation 2012 cooperation, behavioural synchrony and status in social networks social structure as a strategy to mitigate the costs of groupliving: a comparison of gelada and guereza monkeys group size, grooming and social cohesion in primates modelling the role of trust in social relationships time as an ecological constraint the anatomy of friendship beta-endorphin concentrations in cerebrospinal fluid of monkeys are influenced by grooming relationships reinforcing social bonds by touching modulates endogenous âµ-opioid system activity in humans the neurophysiology of unmyelinated tactile afferents stroking modulates noxious-evoked brain activity in human infants the brain opioid theory of social attachment: a review of the evidence shared neural mechanisms underlying social warmth and physical warmth a neurobehavioral model of affiliative bonding: implications for conceptualising a human trait of affiliation the social role of touch in humans and primates: behavioural function and neurobiological mechanisms state-dependent âµ-opioid modulation of social motivation-a model topography of social touching depends on emotional bonds between humans cross-cultural similarity in relationship-specific social touching time: a hidden constraint on the behavioural ecology of baboons social laughter triggers endogenous opioid release in humans the ice-breaker effect: singing mediates fast social bonding naltrexone blocks endorphins released when dancing in synchrony emotional arousal when watching drama increases pain threshold and social bonding 2018 î¼-opioid receptor system mediates reward processing in humans breaking bread: the functions of social eating functional benefits of (modest) alcohol consumption 2010 rowers' high: behavioural synchrony is correlated with elevated pain thresholds silent disco: dancing in synchrony leads to elevated pain thresholds and social closeness do birds of a feather flock together? the relationship between similarity and altruism in social networks birds of a feather: homophily in social networks altruism in social networks: evidence for a 'kinship premium' gender homophily in online dyadic and triadic relationships playing with strangers: which shared traits attract us most to new people? plos one 10, e0129688 2020 homophily in personality enhances group success among real-life friends women favour dyadic relationships, but men prefer clubs communication in social networks: effects of kinship, network size and emotional closeness inclusion of other in the self scale and the structure of interpersonal closeness modelling the evolution of social structure cognitive resource allocation determines the organisation of personal networks triadic closure dynamics drives scaling laws in social multiplex networks constraints on the evolution of social institutions and their implications for information flow social elites emerge naturally in an agentbased framework when interaction patterns are constrained online social networks and information diffusion: the role of ego networks the strength of weak ties on the evolution of language and kinship unravelling the evolutionary function of communities navigation in a small world social interactions across media: interpersonal communication on the internet, telephone and face-to-face size matters: variation in personal network size, personality and effect on information transmission on optimising personal network and managing information flow managing relationship decay: network, gender, and contextual effects detecting sequences of system states in temporal networks communication technology and friendship during the transition from high school to college the persistence of social signatures in human communication sex differences in relationship conflict and reconciliation hamilton's rule predicts anticipated social support in humans the costs of family and friends: an 18-month longitudinal study of relationship maintenance and decay effects of deception in social networks dynamics of deceptive interactions in social networks pathogen prevalence predicts human cross-cultural variability in individualism/collectivism assortative sociality, limited dispersal, infectious disease and the genesis of the global pattern of religion diversity contagion in financial networks contagion in financial networks where the risks lie: a survey on systemic risk fire sales in a model of complexity 2020. the neurobiology of social distance social relationships and mortality risk: a metaanalytic review one thousand families in strong and consistent social bonds enhance the longevity of female baboons social bonds of female baboons enhance infant survival the benefits of social capital: close social bonds among female baboons enhance offspring survival responses to social and environmental stress are attenuated by strong male bonds in wild macaques network connections, dyadic bonds and fitness in wild female baboons social support reduces stress hormone levels in wild chimpanzees across stressful events and everyday affiliations family network size and survival across the lifespan of female macaques safebook: a privacy-preserving online social network leveraging on real-life trust a provably secure secret handshake with dynamic controlled matching making social networks more human: a topological approach absence makes the heart grow fonder: social compensation when failure to interact risks weakening a relationship academy centenary research project (lucy to language), the erc relnet advanced research fellowship, the eu fp7 socialnets grant and the eu horizon 2020 ibsen research grant.acknowledgements. i thank the reviewers for drawing my attention to some useful additional material. key: cord-168862-3tj63eve authors: porter, mason a. title: nonlinearity + networks: a 2020 vision date: 2019-11-09 journal: nan doi: nan sha: doc_id: 168862 cord_uid: 3tj63eve i briefly survey several fascinating topics in networks and nonlinearity. i highlight a few methods and ideas, including several of personal interest, that i anticipate to be especially important during the next several years. these topics include temporal networks (in which the entities and/or their interactions change in time), stochastic and deterministic dynamical processes on networks, adaptive networks (in which a dynamical process on a network is coupled to dynamics of network structure), and network structure and dynamics that include"higher-order"interactions (which involve three or more entities in a network). i draw examples from a variety of scenarios, including contagion dynamics, opinion models, waves, and coupled oscillators. in its broadest form, a network consists of the connectivity patterns and connection strengths in a complex system of interacting entities [121] . the most traditional type of network is a graph g = (v, e) (see fig. 1a) , where v is a set of "nodes" (i.e., "vertices") that encode entities and e ⊆ v × v is a set of "edges" (i.e., "links" or "ties") that encode the interactions between those entities. however, recent uses of the term "network" have focused increasingly on connectivity patterns that are more general than graphs [98] : a network's nodes and/or edges (or their associated weights) can change in time [70, 72] (see section 3), nodes and edges can include annotations [26] , a network can include multiple types of edges and/or multiple types of nodes [90, 140] , it can have associated dynamical processes [142] (see sections 3, 4, and 5) , it can include memory [152] , connections can occur between an arbitrary number of entities [127, 131] (see section 6) , and so on. associated with a graph is an adjacency matrix a with entries a i j . in the simplest scenario, edges either exist or they don't. if edges have directions, a i j = 1 when there is an edge from entity j to entity i and a i j = 0 when there is no such edge. when a i j = 1, node i is "adjacent" to node j (because we can reach i directly from j), and the associated edge is "incident" from node j and to node i. the edge from j to i is an "out-edge" of j and an "in-edge" of i. the number of out-edges of a node is its "out-degree", and the number of in-edges of a node is its "in-degree". for an undirected network, a i j = a ji , and the number of edges that are attached to a node is the node's "degree". one can assign weights to edges to represent connections with different strengths (e.g., stronger friendships or larger transportation capacity) by defining a function w : e −→ r. in many applications, the weights are nonnegative, although several applications [180] (such as in international relations) incorporate positive, negative, and zero weights. in some applications, nodes can also have selfedges and multi-edges. the spectral properties of adjacency (and other) matrices give important information about their associated graphs [121, 187] . for undirected networks, it is common to exploit the beneficent property that all eigenvalues of symmetric matrices are real. traditional studies of networks consider time-independent structures, but most networks evolve in time. for example, social networks of people and animals change based on their interactions, roads are occasionally closed for repairs and new roads are built, and airline routes change with the seasons and over the years. to study such time-dependent structures, one can analyze "temporal networks". see [70, 72] for reviews and [73, 74] for edited collections. the key idea of a temporal network is that networks change in time, but there are many ways to model such changes, and the time scales of interactions and other changes play a crucial role in the modeling process. there are also other [i drew this network using tikz-network, by jürgen hackl and available at https://github.com/hackl/tikz-network), which allows one to draw networks (including multilayer networks) directly in a l a t e x file.] . an example of a multilayer network with three layers. we label each layer using di↵erent colours for its state nodes and its edges: black nodes and brown edges (three of which are unidirectional) for layer 1, purple nodes and green edges for layer 2, and pink nodes and grey edges for layer 3. each state node (i.e. nodelayer tuple) has a corresponding physical node and layer, so the tuple (a, 3) denotes physical node a on layer 3, the tuple (d, 1) denotes physical node d on layer 1, and so on. we draw intralayer edges using solid arcs and interlayer edges using broken arcs; an interlayer edge is dashed (and magenta) if it connects corresponding entities and dotted (and blue) if it connects distinct ones. we include arrowheads to represent unidirectional edges. we drew this network using tikz-network (jürgen hackl, https://github.com/hackl/tikz-network), which allows one to draw multilayer networks directly in a l at ex file. , which is by jürgen hackl and is available at https://github.com/hackl/tikz-network. panel (b) is inspired by fig. 1 of [72] . panel (d), which is in the public domain, was drawn by wikipedia user cflm001 and is available at https://en.wikipedia.org/wiki/simplicial_complex.] important modeling considerations. to illustrate potential complications, suppose that an edge in a temporal network represents close physical proximity between two people in a short time window (e.g., with a duration of two minutes). it is relevant to consider whether there is an underlying social network (e.g., the friendship network of mathematics ph.d. students at ucla) or if the people in the network do not in general have any other relationships with each other (e.g., two people who happen to be visiting a particular museum on the same day). in both scenarios, edges that represent close physical proximity still appear and disappear over time, but indirect connections (i.e., between people who are on the same connected component, but without an edge between them) in a time window may play different roles in the spread of information. moreover, network structure itself is often influenced by a spreading process or other dynamics, as perhaps one arranges a meeting to discuss a topic (e.g., to give me comments on a draft of this chapter). see my discussion of adaptive networks in section 5. for convenience, most work on temporal networks employs discrete time (see fig. 1(b) ). discrete time can arise from the natural discreteness of a setting, dis-cretization of continuous activity over different time windows, data measurement that occurs at discrete times, and so on. one way to represent a discrete-time (or discretized-time) temporal network is to use the formalism of "multilayer networks" [90, 140] . one can also use multilayer networks to study networks with multiple types of relations, networks with multiple subsystems, and other complicated networked structures. fig. 1 (c)) has a set v of nodesthese are sometimes called "physical nodes", and each of them corresponds to an entity, such as a person -that have instantiations as "state nodes" (i.e., node-layer tuples, which are elements of the set v m ) on layers in l. one layer in the set l is a combination, through the cartesian product l 1 × · · · × l d , of elementary layers. the number d indicates the number of types of layering; these are called "aspects". a temporal network with one type of relationship has one type of layering, a timeindependent network with multiple types of social relationships also has one type of layering, a multirelational network that changes in time has two types of layering, and so on. the set of state nodes in m is v m ⊆ v × l 1 × · · · × l d , and the set of indicates that there is an edge from node j on layer β to node i on layer α (and vice versa, if m is undirected). for example, in fig. 1(c) , there is a directed intralayer edge from (a,1) to (b,1) and an undirected interlayer edge between (a,1) and (a,2). the multilayer network in fig. 1 (c) has three layers, |v | = 5 physical nodes, d = 1 aspect, |v m | = 13 state nodes, and |e m | = 20 edges. to consider weighted edges, one proceeds as in ordinary graphs by defining a function w : e m −→ r. as in ordinary graphs, one can also incorporate self-edges and multi-edges. multilayer networks can include both intralayer edges (which have the same meaning as in graphs) and interlayer edges. the multilayer network in fig. 1 (c) has 4 directed intralayer edges, 10 undirected intralayer edges, and 6 undirected interlayer edges. in most studies thus far of multilayer representations of temporal networks, researchers have included interlayer edges only between state nodes in consecutive layers and only between state nodes that are associated with the same entity (see fig. 1 (c)). however, this restriction is not always desirable (see [184] for an example), and one can envision interlayer couplings that incorporate ideas like time horizons and interlayer edge weights that decay over time. for convenience, many researchers have used undirected interlayer edges in multilayer analyses of temporal networks, but it is often desirable for such edges to be directed to reflect the arrow of time [176] . the sequence of network layers, which constitute time layers, can represent a discrete-time temporal network at different time instances or a continuous-time network in which one bins (i.e., aggregates) the network's edges to form a sequence of time windows with interactions in each window. each d-aspect multilayer network with the same number of nodes in each layer has an associated adjacency tensor a of order 2(d + 1). for unweighted multilayer networks, each edge in e m is associated with a 1 entry of a, and the other entries (the "missing" edges) are 0. if a multilayer network does not have the same number of nodes in each layer, one can add empty nodes so that it does, but the edges that are attached to such nodes are "forbidden". there has been some research on tensorial properties of a [35] (and it is worthwhile to undertake further studies of them), but the most common approach for computations is to flatten a into a "supra-adjacency matrix" a m [90, 140] , which is the adjacency matrix of the graph g m that is associated with m. the entries of diagonal blocks of a m correspond to intralayer edges, and the entries of off-diagonal blocks correspond to interlayer edges. following a long line of research in sociology [37] , two important ingredients in the study of networks are examining (1) the importances ("centralities") of nodes, edges, and other small network structures and the relationship of measures of importance to dynamical processes on networks and (2) the large-scale organization of networks [121, 193] . studying central nodes in networks is useful for numerous applications, such as ranking web pages, football teams, or physicists [56] . it can also help reveal the roles of nodes in networks, such as those that experience high traffic or help bridge different parts of a network [121, 193] . mesoscale features can impact network function and dynamics in important ways. small subgraphs called "motifs" may appear frequently in some networks [111] , perhaps indicating fundamental structures such as feedback loops and other building blocks of global behavior [59] . various types of largerscale network structures, such as dense "communities" of nodes [47, 145] and coreperiphery structures [33, 150] , are also sometimes related to dynamical modules (e.g., a set of synchronized neurons) or functional modules (e.g., a set of proteins that are important for a certain regulatory process) [164] . a common way to study large-scale structures1 is inference using statistical models of random networks, such as through stochastic block models (sbms) [134] . much recent research has generalized the study of large-scale network structure to temporal and multilayer networks [3, 74, 90] . various types of centrality -including betweenness centrality [88, 173] , bonacich and katz centrality [65, 102] , communicability [64] , pagerank [151, 191] , and eigenvector centrality [46, 146] -have been generalized to temporal networks using a variety of approaches. such generalizations make it possible to examine how node importances change over time as network structure evolves. in recent work, my collaborators and i used multilayer representations of temporal networks to generalize eigenvector-based centralities to temporal networks [175, 176] .2 one computes the eigenvector-based centralities of nodes for a timeindependent network as the entries of the "dominant" eigenvector, which is associated with the largest positive eigenvalue (by the perron-frobenius theorem, the eigenvalue with the largest magnitude is guaranteed to be positive in these situations) of a centrality matrix c(a). examples include eigenvector centrality (by using c(a) = a) [17] , hub and authority scores3 (by using c(a) = aa t for hubs and a t a for authorities) [91] , and pagerank [56] . given a discrete-time temporal network in the form of a sequence of adjacency matrices i j denotes a directed edge from entity i to entity j in time layer t, we construct a "supracentrality matrix" c(ω), which couples centrality matrices c(a (t) ) of the individual time layers. we then compute the dominant eigenvector of c(ω), where ω is an interlayer coupling strength.4 in [175, 176] , a key example was the ranking of doctoral programs in the mathematical sciences (using data from the mathematics genealogy project [147] ), where an edge from one institution to another arises when someone with a ph.d. from the first institution supervises a ph.d. student at the second institution. by calculating timedependent centralities, we can study how the rankings of mathematical-sciences doctoral programs change over time and the dependence of such rankings on the value of ω. larger values of ω impose more ranking consistency across time, so centrality trajectories are less volatile for larger ω [175, 176] . multilayer representations of temporal networks have been very insightful in the detection of communities and how they split, merge, and otherwise evolve over time. numerous methods for community detection -including inference via sbms [135] , maximization of objective functions (especially "modularity") [117] , and methods based on random walks and bottlenecks to their traversal of a network [38, 80] -have been generalized from graphs to multilayer networks. they have yielded insights in a diverse variety of applications, including brain networks [183] , granular materials [129] , political voting networks [113, 117] , disease spreading [158] , and ecology and animal behavior [45, 139] . to assist with such applications, there are efforts to develop and analyze multilayer random-network models that incorporate rich and flexible structures [11] , such as diverse types of interlayer correlations. activity-driven (ad) models of temporal networks [136] are a popular family of generative models that encode instantaneous time-dependent descriptions of network dynamics through a function called an "activity potential", which encodes the mechanism to generate connections and characterizes the interactions between enti-ties in a network. an activity potential encapsulates all of the information about the temporal network dynamics of an ad model, making it tractable to study dynamical processes (such as ones from section 4) on networks that are generated by such a model. it is also common to compare the properties of networks that are generated by ad models to those of empirical temporal networks [74] . in the original ad model of perra et al. [136] , one considers a network with n entities, which we encode by the nodes. we suppose that node i has an activity rate a i = ηx i , which gives the probability per unit time to create new interactions with other nodes. the scaling factor η ensures that the mean number of active nodes per unit time is η we define the activity rates such that x i ∈ [ , 1], where > 0, and we assign each x i from a probability distribution f(x) that can either take a desired functional form or be constructed from empirical data. the model uses the following generative process: • at each discrete time step (of length ∆t), start with a network g t that consists of n isolated nodes. • with a probability a i ∆t that is independent of other nodes, node i is active and generates m edges, each of which attaches to other nodes uniformly (i.e., with the same probability for each node) and independently at random (without replacement). nodes that are not active can still receive edges from active nodes. • at the next time step t + ∆t, we delete all edges from g t , so all interactions have a constant duration of ∆t. we then generate new interactions from scratch. this is convenient, as it allows one to apply techniques from markov chains. because entities in time step t do not have any memory of previous time steps, f(x) encodes the network structure and dynamics. the ad model of perra et al. [136] is overly simplistic, but it is amenable to analysis and has provided a foundation for many more general ad models, including ones that incorporate memory [200] . in section 6.4, i discuss a generalization of ad models to simplicial complexes [137] that allows one to study instantaneous interactions that involve three or more entities in a network. many networked systems evolve continuously in time, but most investigations of time-dependent networks rely on discrete or discretized time. it is important to undertake more analysis of continuous-time temporal networks. researchers have examined continuous-time networks in a variety of scenarios. examples include a compartmental model of biological contagions [185] , a generalization of katz centrality to continuous time [65] , generalizations of ad models (see section 3.1.3) to continuous time [198, 199] , and rankings in competitive sports [115] . in a recent paper [2] , my collaborators and i formulated a notion of "tie-decay networks" for studying networks that evolve in continuous time. they distinguished between interactions, which they modeled as discrete contacts, and ties, which encode relationships and their strength as a function of time. for example, perhaps the strength of a tie decays exponentially after the most recent interaction. more realistically, perhaps the decay rate depends on the weight of a tie, with strong ties decaying more slowly than weak ones. one can also use point-process models like hawkes processes [99] to examine similar ideas using a node-centric perspective. suppose that there are n interacting entities, and let b(t) be the n × n timedependent, real, non-negative matrix whose entries b i j (t) encode the tie strength between agents i and j at time t. in [2] , we made the following simplifying assumptions: 1. as in [81] , ties decay exponentially when there are no interactions: where α ≥ 0 is the decay rate. 2. if two entities interact at time t = τ, the strength of the tie between them grows instantaneously by 1. see [201] for a comparison of various choices, including those in [2] and [81] , for tie evolution over time. in practice (e.g., in data-driven applications), one obtains b(t) by discretizing time, so let's suppose that there is at most one interaction during each time step of length ∆t. this occurs, for example, in a poisson process. such time discretization is common in the simulation of stochastic dynamical systems, such as in gillespie algorithms [41, 142, 189] . consider an n × n matrix a(t) in which a i j (t) = 1 if node i interacts with node j at time t and a i j (t) = 0 otherwise. for a directed network, a(t) has exactly one nonzero entry during each time step when there is an interaction and no nonzero entries when there isn't one. for an undirected network, because of the symmetric nature of interactions, there are exactly two nonzero entries in time steps that include an interaction. we write equivalently, if interactions between entities occur at times τ ( ) such that 0 ≤ τ (0) < τ (1) < . . . < τ (t ) , then at time t ≥ τ (t ) , we have in [2] , my coauthors and i generalized pagerank [20, 56] to tie-decay networks. one nice feature of their tie-decay pagerank is that it is applicable not just to data sets, but also to data streams, as one updates the pagerank values as new data arrives. by contrast, one problematic feature of many methods that rely on multilayer representations of temporal networks is that one needs to recompute everything for an entire data set upon acquiring new data, rather than updating prior results in a computationally efficient way. a dynamical process can be discrete, continuous, or some mixture of the two; it can also be either deterministic or stochastic. it can take the form of one or several coupled ordinary differential equations (odes), partial differential equations (pdes), maps, stochastic differential equations, and so on. a dynamical process requires a rule for updating the states of its dependent variables with respect one or more independent variables (e.g., time), and one also has (one or a variety of) initial conditions and/or boundary conditions. to formalize a dynamical process on a network, one needs a rule for how to update the states of the nodes and/or edges. the nodes (of one or more types) of a network are connected to each other in nontrivial ways by one or more types of edges. this leads to a natural question: how does nontrivial connectivity between nodes affect dynamical processes on a network [142] ? when studying a dynamical process on a network, the network structure encodes which entities (i.e., nodes) of a system interact with each other and which do not. if desired, one can ignore the network structure entirely and just write out a dynamical system. however, keeping track of network structure is often a very useful and insightful form of bookkeeping, which one can exploit to systematically explore how particular structures affect the dynamics of particular dynamical processes. prominent examples of dynamical processes on networks include coupled oscillators [6, 149] , games [78] , and the spread of diseases [89, 130] and opinions [23, 100] . there is also a large body of research on the control of dynamical processes on networks [103, 116] . most studies of dynamics on networks have focused on extending familiar models -such as compartmental models of biological contagions [89] or kuramoto phase oscillators [149] -by coupling entities using various types of network structures, but it is also important to formulate new dynamical processes from scratch, rather than only studying more complicated generalizations of our favorite models. when trying to illuminate the effects of network structure on a dynamical process, it is often insightful to provide a baseline comparison by examining the process on a convenient ensemble of random networks [142] . a simple, but illustrative, dynamical process on a network is the watts threshold model (wtm) of a social contagion [100, 142] . it provides a framework for illustrating how network structure can affect state changes, such as the adoption of a product or a behavior, and for exploring which scenarios lead to "virality" (in the form of state changes of a large number of nodes in a network). the original wtm [194] , a binary-state threshold model that resembles bootstrap percolation [24] , has a deterministic update rule, so stochasticity can come only from other sources (see section 4.2). in a binary state model, each node is in one of two states; see [55] for a tabulation of well-known binary-state dynamics on networks. the wtm is a modification of mark granovetter's threshold model for social influence in a fully-mixed population [62] . see [86, 186] for early work on threshold models on networks that developed independently from investigations of the wtm. threshold contagion models have been developed for many scenarios, including contagions with multiple stages [109] , models with adoption latency [124] , models with synergistic interactions [83] , and situations with hipsters (who may prefer to adopt a minority state) [84] . in a binary-state threshold model such as the wtm, each node i has a threshold r i that one draws from some distribution. suppose that r i is constant in time, although one can generalize it to be time-dependent. at any time, each node can be in one of two states: 0 (which represents being inactive, not adopted, not infected, and so on) or 1 (active, adopted, infected, and so on). a binary-state model is a drastic oversimplification of reality, but the wtm is able to capture two crucial features of social systems [125] : interdependence (an entity's behavior depends on the behavior of other entities) and heterogeneity (as nodes with different threshold values behave differently). one can assign a seed number or seed fraction of nodes to the active state, and one can choose the initially active nodes either deterministically or randomly. the states of the nodes change in time according to an update rule, which can either be synchronous (such that it is a map) or asynchronous (e.g., as a discretization of continuous time) [142] . in the wtm, the update rule is deterministic, so this choice affects only how long it takes to reach a steady state; it does not affect the steady state itself. with a stochastic update rule, the synchronous and asynchronous versions of ostensibly the "same" model can behave in drastically different ways [43] . in the wtm on an undirected network, to update the state of a node, one compares its fraction s i /k i of active neighbors (where s i is the number of active neighbors and k i is the degree of node i) to the node's threshold r i . an inactive node i becomes active (i.e., it switches from state 0 to state 1) if s i /k i ≥ r i ; otherwise, it stays inactive. the states of nodes in the wtm are monotonic, in the sense that a node that becomes active remains active forever. this feature is convenient for deriving accurate approximations for the global behavior of the wtm using branchingprocess approximations [55, 142] or when analyzing the behavior of the wtm using tools such as persistent homology [174] . a dynamical process on a network can take the form of a stochastic process [121, 142] . there are several possible sources of stochasticity: (1) choice of initial condition, (2) choice of which nodes or edges to update (when considering asynchronous updating), (3) the rule for updating nodes or edges, (4) the values of parameters in an update rule, and (5) selection of particular networks from a random-graph ensemble (i.e., a probability distribution on graphs). some or all of these sources of randomness can be present when studying dynamical processes on networks. it is desirable to compare the sample mean of a stochastic process on a network to an ensemble average (i.e., to an expectation over a suitable probability distribution). prominent examples of stochastic processes on networks include percolation [153] , random walks [107] , compartment models of biological contagions [89, 130] , bounded-confidence models with continuous-valued opinions [110] , and other opinion and voter models [23, 100, 142, 148] . compartmental models of biological contagions are a topic of intense interest in network science [89, 121, 130, 142] . a compartment represents a possible state of a node; examples include susceptible, infected, zombified, vaccinated, and recovered. an update rule determines how a node changes its state from one compartment to another. one can formulate models with as many compartments as desired [18] , but investigations of how network structure affects dynamics typically have employed examples with only two or three compartments [89, 130] . researchers have studied various extensions of compartmental models, contagions on multilayer and temporal networks [4, 34, 90] , metapopulation models on networks [30] for simultaneously studying network connectivity and subpopulations with different characteristics, non-markovian contagions on networks for exploring memory effects [188] , and explicit incorporation of individuals with essential societal roles (e.g., health-care workers) [161] . as i discuss in section 4.4, one can also examine coupling between biological contagions and the spread of information (e.g., "awareness") [50, 192] . one can also use compartmental models to study phenomena, such as dissemination of ideas on social media [58] and forecasting of political elections [190] , that are much different from the spread of diseases. one of the most prominent examples of a compartmental model is a susceptibleinfected-recovered (sir) model, which has three compartments. susceptible nodes are healthy and can become infected, and infected nodes can eventually recover. the steady state of the basic sir model on a network is related to a type of bond percolation [63, 68, 87, 181] . there are many variants of sir models and other compartmental models on networks [89] . see [114] for an illustration using susceptible-infectedsusceptible (sis) models. suppose that an infection is transmitted from an infected node to a susceptible neighbor at a rate of λ. the probability of a transmission event on one edge between an infected node and a susceptible node in an infinitesimal time interval dt is λ dt. assuming that all infection events are independent, the probability that a susceptible node with s infected neighbors becomes infected (i.e., for a node to transition from the s compartment to the i compartment, which represents both being infected and being infective) during dt is if an infected node recovers at a constant rate of µ, the probability that it switches from state i to state r in an infinitesimal time interval dt is µ dt. when there is no source of stochasticity, a dynamical process on a network is "deterministic". a deterministic dynamical system can take the form of a system of coupled maps, odes, pdes, or something else. as with stochastic systems, the network structure encodes which entities of a system interact with each other and which do not. there are numerous interesting deterministic dynamical systems on networksjust incorporate nontrivial connectivity between entities into your favorite deterministic model -although it is worth noting that some stochastic features (e.g., choosing parameter values from a probability distribution or sampling choices of initial conditions) can arise in these models. for concreteness, let's consider the popular setting of coupled oscillators. each node in a network is associated with an oscillator, and we want to examine how network structure affects the collective behavior of the coupled oscillators. it is common to investigate various forms of synchronization (a type of coherent behavior), such that the rhythms of the oscillators adjust to match each other (or to match a subset of the oscillators) because of their interactions [138] . a variety of methods, such as "master stability functions" [132] , have been developed to study the local stability of synchronized states and their generalizations [6, 142] , such as cluster synchrony [133] . cluster synchrony, which is related to work on "coupled-cell networks" [59] , uses ideas from computational group theory to find synchronized sets of oscillators that are not synchronized with other sets of synchronized oscillators. many studies have also examined other types of states, such as "chimera states" [128] , in which some oscillators behave coherently but others behave incoherently. (analogous phenomena sometimes occur in mathematics departments.) a ubiquitous example is coupled kuramoto oscillators on a network [6, 39, 149] , which is perhaps the most common setting for exploring and developing new methods for studying coupled oscillators. (in principle, one can then build on these insights in studies of other oscillatory systems, such as in applications in neuroscience [7] .) coupled kuramoto oscillators have been used for modeling numerous phenomena, including jetlag [104] and singing in frogs [126] . indeed, a "snowbird" (siam) conference on applied dynamical systems would not be complete without at least several dozen talks on the kuramoto model. in the kuramoto model, each node i has an associated phase θ i (t) ∈ [0, 2π). in the case of "diffusive" coupling between the nodes5, the dynamics of the ith node is governed by the equation where one typically draws the natural frequency ω i of node i from some distribution g(ω), the scalar a i j is an adjacency-matrix entry of an unweighted network, b i j is the coupling strength on oscillator i from oscillator j (so b i j a i j is an element of an adjacency matrix w of a weighted network), and f i j (y) = sin(y) is the coupling function, which depends only on the phase difference between oscillators i and j because of the diffusive nature of the coupling. once one knows the natural frequencies ω i , the model (4) is a deterministic dynamical system, although there have been studies of coupled kuramoto oscillators with additional stochastic terms [60] . traditional studies of (4) and its generalizations draw the natural frequencies from some distribution (e.g., a gaussian or a compactly supported distribution), but some studies of so-called "explosive synchronization" (in which there is an abrupt phase transition from incoherent oscillators to synchronized oscillators) have employed deterministic natural frequencies [16, 39] . the properties of the frequency distribution g(ω) have a significant effect on the dynamics of (4). important features of g(ω) include whether it has compact support or not, whether it is symmetric or asymmetric, and whether it is unimodal or not [149, 170] . the model (4) has been generalized in numerous ways. for example, researchers have considered a large variety of coupling functions f i j (including ones that are not diffusive) and have incorporated an inertia term θ i to yield a second-order kuramoto oscillator at each node [149] . the latter generalization is important for studies of coupled oscillators and synchronized dynamics in electric power grids [196] . another noteworthy direction is the analysis of kuramoto model on "graphons" (see, e.g., [108] ), an important type of structure that arises in a suitable limit of large networks. an increasingly prominent topic in network analysis is the examination of how multilayer network structures -multiple system components, multiple types of edges, co-occurrence and coupling of multiple dynamical processes, and so onaffect qualitative and quantitative dynamics [3, 34, 90] . for example, perhaps certain types of multilayer structures can induce unexpected instabilities or phase transitions in certain types of dynamical processes? there are two categories of dynamical processes on multilayer networks: (1) a single process can occur on a multilayer network; or (2) processes on different layers of a multilayer network can interact with each other [34] . an important example of the first category is a random walk, where the relative speeds and probabilities of steps within layers versus steps between layers affect the qualitative nature of the dynamics. this, in turn, affects methods (such as community detection [38, 80] ) that are based on random walks, as well as anything else in which the diffusion is relevant [22, 36] . two other examples of the first category are the spread of information on social media (for which there are multiple communication channels, such as facebook and twitter) and multimodal transportation systems [51] . for instance, a multilayer network structure can induce congestion even when a system without coupling between layers is decongested in each layer independently [1] . examples of the second category of dynamical process are interactions between multiple strains of a disease and interactions between the spread of disease and the spread of information [49, 50, 192] . many other examples have been studied [3] , including coupling between oscillator dynamics on one layer and a biased random walk on another layer (as a model for neuronal oscillations coupled to blood flow) [122] . numerous interesting phenomena can occur when dynamical systems, such as spreading processes, are coupled to each other [192] . for example, the spreading of one disease can facilitate infection by another [157] , and the spread of awareness about a disease can inhibit spread of the disease itself (e.g., if people stay home when they are sick) [61] . interacting spreading processes can also exhibit other fascinating dynamics, such as oscillations that are induced by multilayer network structures in a biological contagion with multiple modes of transmission [79] and novel types of phase transitions [34] . a major simplification in most work thus far on dynamical processes on multilayer networks is a tendency to focus on toy models. for example, a typical study of coupled spreading processes may consider a standard (e.g., sir) model on each layer, and it may draw the connectivity pattern of each layer from the same standard random-graph model (e.g., an erdős-rényi model or a configuration model). however, when studying dynamics on multilayer networks, it is particular important in future work to incorporate heterogeneity in network structure and/or dynamical processes. for instance, diseases spread offline but information spreads both offline and online, so investigations of coupled information and disease spread ought to consider fundamentally different types of network structures for the two processes. network structures also affect the dynamics of pdes on networks [8, 31, 57, 77, 112] . interesting examples include a study of a burgers equation on graphs to investigate how network structure affects the propagation of shocks [112] and investigations of reaction-diffusion equations and turing patterns on networks [8, 94] . the latter studies exploit the rich theory of laplacian dynamics on graphs (and concomitant ideas from spectral graph theory) [107, 187] and examine the addition of nonlinear terms to laplacians on various types of networks (including multilayer ones). a mathematically oriented thread of research on pdes on networks has built on ideas from so-called "quantum graphs" [57, 96] to study wave propagation on networks through the analysis of "metric graphs". metric graphs differ from the usual "combinatorial graphs", which in other contexts are usually called simply "graphs". 6 in metric graphs, in addition to nodes and edges, each edge e has a positive length l e ∈ (0, ∞]. for many experimentally relevant scenarios (e.g., in models of circuits of quantum wires [195] ), there is a natural embedding into space, but metric graphs that are not embedded in space are also appropriate for some applications. as the nomenclature suggests, one can equip a metric graph with a natural metric. if a sequence {e j } m j=1 of edges forms a path, the length of the path is j l j . the distance ρ(v 1 , v 2 ) between two nodes, v 1 and v 2 , is the minimum path length between them. we place coordinates along each edge, so we can compute a distance between points x 1 and x 2 on a metric graph even when those points are not located at nodes. traditionally, one assumes that the infinite ends (which one can construe as "leads" at infinity, as in scattering theory) of infinite edges have degree 1. it is also traditional to assume that there is always a positive distance between distinct nodes and that there are no finite-length paths with infinitely many edges. see [96] for further discussion. to study waves on metric graphs, one needs to define operators, such as the negative second derivative or more general schrödinger operators. this exploits the fact that there are coordinates for all points on the edges -not only at the nodes themselves, as in combinatorial graphs. when studying waves on metric graphs, it is also necessary to impose boundary conditions at the nodes [96] . many studies of wave propagation on metric graphs have considered generalizations of nonlinear wave equations, such as the cubic nonlinear schrödinger (nls) equation [123] and a nonlinear dirac equation [154] . the overwhelming majority of studies in metric graphs (with both linear and nonlinear waves) have focused on networks with a very small number of nodes, as even small networks yield very interesting dynamics. for example, marzuola and pelinovsky [106] analyzed symmetry-breaking and symmetry-preserving bifurcations of standing waves of the cubic nls on a dumbbell graph (with two rings attached to a central line segment and kirchhoff boundary conditions at the nodes). kairzhan et al. [85] studied the spectral stability of half-soliton standing waves of the cubic nls equation on balanced star graphs. sobirov et al. [168] studied scattering and transmission at nodes of sine-gordon solitons on networks (e.g., on a star graph and a small tree). a particularly interesting direction for future work is to study wave dynamics on large metric graphs. this will help extend investigations, as in odes and maps, of how network structures affect dynamics on networks to the realm of linear and nonlinear waves. one can readily formulate wave equations on large metric graphs by specifying relevant boundary conditions and rules at each junction. for example, joly et al. [82] recently examined wave propagation of the standard linear wave equation on fractal trees. because many natural real-life settings are spatially embedded (e.g., wave propagation in granular materials [101, 129] and traffic-flow patterns in cities), it will be particularly valuable to examine wave dynamics on (both synthetic and empirical) spatially-embedded networks [9] . therefore, i anticipate that it will be very insightful to undertake studies of wave dynamics on networks such as random geometric graphs, random neighborhood graphs, and other spatial structures. a key question in network analysis is how different types of network structure affect different types of dynamical processes [142] , and the ability to take a limit as model synthetic networks become infinitely large (i.e., a thermodynamic limit) is crucial for obtaining many key theoretical insights. dynamics of networks and dynamics on networks do not occur in isolation; instead, they are coupled to each other. researchers have studied the coevolution of network structure and the states of nodes and/or edges in the context of "adaptive networks" (which are also known as "coevolving networks") [66, 159] . whether it is sensible to study a dynamical process on a time-independent network, a temporal network with frozen (or no) node or edge states, or an adaptive network depends on the relative time scales of the dynamics of network structure and the states of nodes and/or edges of a network. see [142] for a brief discussion. models in the form of adaptive networks provide a promising mechanistic approach to simultaneously explain both structural features (e.g., degree distributions and temporal features (e.g., burstiness) of empirical data [5] . incorporating adaptation into conventional models can produce extremely interesting and rich dynamics, such as the spontaneous development of extreme states in opinion models [160] . most studies of adaptive networks that include some analysis (i.e., that go beyond numerical computations) have employed rather artificial adaption rules for adding, removing, and rewiring edges. this is relevant for mathematical tractability, but it is important to go beyond these limitations by considering more realistic types of adaptation and coupling between network structure (including multilayer structures, as in [12] ) and the states of nodes and edges. when people are sick, they stay home from work or school. people also form and remove social connections (both online and offline) based on observed opinions and behaviors. to study these ideas using adaptive networks, researchers have coupled models of biological and social contagions with time-dependent networks [100, 142] . an early example of an adaptive network of disease spreading is the susceptibleinfected (si) model in gross et al. [67] . in this model, susceptible nodes sometimes rewire their incident edges to "protect themselves". suppose that we have an n-node network with a constant number of undirected edges. each node is either susceptible (i.e., of type s) or infected (i.e., of type i). at each time step, and for each edge -so-called "discordant edges" -between nodes of different types, the susceptible node becomes infected with probability λ. for each discordant edge, with some probability κ, the incident susceptible node breaks the edge and rewires to some other susceptible node. this is a "rewire-to-same" mechanism, to use the language from some adaptive opinion models [40, 97] . (in this model, multi-edges and selfedges are not allowed.) during each time step, infected nodes can also recover to become susceptible again. gross et al. [67] studied how the rewiring probability affects the "basic reproductive number", which measures how many secondary infections on average occur for each primary infection [18, 89, 130] . this scalar quantity determines the size of a critical infection probability λ * to maintain a stable epidemic (as determined traditionally using linear stability analysis of an endemic state). a high rewiring rate can significantly increase λ * and thereby significantly reduce the prevalence of a contagion. although results like these are perhaps intuitively clear, other studies of contagions on adaptive networks have yielded potentially actionable (and arguably nonintuitive) insights. for example, scarpino et al. [161] demonstrated using an adaptive compartmental model (along with some empirical evidence) that the spread of a disease can accelerate when individuals with essential societal roles (e.g., health-care workers) become ill and are replaced with healthy individuals. another type of model with many interesting adaptive variants are opinion models [23, 142] , especially in the form of generalizations of classical voter models [148] . voter dynamics were first considered in the 1970s by clifford and sudbury [29] as a model for species competition, and the dynamical process that they introduced was dubbed "the voter model"7 by holley and liggett shortly thereafter [69] . voter dynamics are fun and are popular to study [148] , although it is questionable whether it is ever possible to genuinely construe voter models as models of voters [44] . holme and newman [71] undertook an early study of a rewire-to-same adaptive voter model. inspired by their research, durrett et al. [40] compared the dynamics from two different types of rewiring in an adaptive voter model. in each variant of their model, one considers an n-node network and supposes that each node is in one of two states. the network structure and the node states coevolve. pick an edge uniformly at random. if this edge is discordant, then with probability 1 − κ, one of its incident nodes adopts the opinion state of the other. otherwise, with complementary probability κ, a rewiring action occurs: one removes the discordant edge, and one of the associated nodes attaches to a new node either through a rewire-to-same mechanism (choosing uniformly at random among the nodes with the same opinion state) or through a "rewire-to-random" mechanism (choosing uniformly at random among all nodes). as with the adaptive si model in [67] , self-edges and multi-edges are not allowed. the models in [40] evolve until there are no discordant edges. there are several key questions. does the system reach a consensus (in which all nodes are in the same state)? if so, how long does it take to converge to consensus? if not, how many opinion clusters (each of which is a connected component, perhaps interpretable as an "echo chamber", of the final network) are there at steady state? how long does it take to reach this state? the answers and analysis are subtle; they depend on the initial network topology, the initial conditions, and the specific choice of rewiring rule. as with other adaptive network models, researchers have developed some nonrigorous theory (e.g., using mean-field approximations and their generalizations) on adaptive voter models with simplistic rewiring schemes, but they have struggled to extend these ideas to models with more realistic rewiring schemes. there are very few mathematically rigorous results on adaptive voter models, although there do exist some, under various assumptions on initial network structure and edge density [10] . researchers have generalized adaptive voter models to consider more than two opinion states [163] and more general types of rewiring schemes [105] . as with other adaptive networks, analyzing adaptive opinion models with increasingly diverse types of rewiring schemes (ideally with a move towards increasing realism) is particularly important. in [97] , yacoub kureh and i studied a variant of a voter model with nonlinear rewiring (where the probability that a node rewires or adopts is a function of how well it "fits in" within its neighborhood), including a "rewire-tonone" scheme to model unfriending and unfollowing in online social networks. it is also important to study adaptive opinion models with more realistic types of opinion dynamics. a promising example is adaptive generalizations of bounded-confidence models (see the introduction of [110] for a brief review of bounded-confidence models), which have continuous opinion states, with nodes interacting either with nodes or with other entities (such as media [21] ) whose opinion is sufficiently close to theirs. a recent numerical study examined an adaptive bounded-confidence model [19] ; this is an important direction for future investigations. it is also interesting to examine how the adaptation of oscillators -including their intrinsic frequencies and/or the network structure that couples them to each other -affects the collective behavior (e.g., synchronization) of a network of oscillators [149] . such ideas are useful for exploring mechanistic models of learning in the brain (e.g., through adaptation of coupling between oscillators to produce a desired limit cycle [171] ). one nice example is by skardal et al. [167] , who examined an adaptive model of coupled kuramoto oscillators as a toy model of learning. first, we write the kuramoto system as where f i j is a 2π-periodic function of the phase difference between oscillators i and j. one way to incorporate adaptation is to define an "order parameter" r i (which, in its traditional form, quantifies the amount of coherence of the coupled kuramoto oscillators [149] ) for the ith oscillator by and to consider the following dynamical system: where re(ζ) denotes the real part of a quantity ζ and im(ζ) denotes its imaginary part. in the model (6), λ d denotes the largest positive eigenvalue of the adjacency matrix a, the variable z i (t) is a time-delayed version of r i with time parameter τ (with τ → 0 implying that z i → r i ), and z * i denotes the complex conjugate of z i . one draws the frequencies ω i from some distribution (e.g., a lorentz distribution, as in [167] ), and we recall that b i j is the coupling strength on oscillator i from oscillator j. the parameter t gives an adaptation time scale, and α ∈ r and β ∈ r are parameters (which one can adjust to study bifurcations). skardal et al. [167] interpreted scenarios with β > 0 as "hebbian" adaptation (see [27] ) and scenarios with β < 0 as anti-hebbian adaptation, as they observed that oscillator synchrony is promoted when β > 0 and inhibited when β < 0. most studies of networks have focused on networks with pairwise connections, in which each edge (unless it is a self-edge, which connects a node to itself) connects exactly two nodes to each other. however, many interactions -such as playing games, coauthoring papers and other forms of collaboration, and horse racesoften occur between three or more entities of a network. to examine such situations, researchers have increasingly studied "higher-order" structures in networks, as they can exert a major influence on dynamical processes. perhaps the simplest way to account for higher-order structures in networks is to generalize from graphs to "hypergraphs" [121] . hypergraphs possess "hyperedges" that encode a connection between on arbitrary number of nodes, such as between all coauthors of a paper. this allows one to make important distinctions, such as between a k-clique (in which there are pairwise connections between each pair of nodes in a set of k nodes) and a hyperedge that connects all k of those nodes to each other, without the need for any pairwise connections. one way to study a hypergraph is as a "bipartite network", in which nodes of a given type can be adjacent only to nodes of another type. for example, a scientist can be adjacent to a paper that they have written [119] , and a legislator can be adjacent to a committee on which they sit [144] . it is important to generalize ideas from graph theory to hypergraphs, such as by developing models of random hypergraphs [25, 26, 52 ]. another way to study higher-order structures in networks is to use "simplicial complexes" [53, 54, 127] . a simplicial complex is a space that is built from a union of points, edges, triangles, tetrahedra, and higher-dimensional polytopes (see fig. 1d ). simplicial complexes approximate topological spaces and thereby capture some of their properties. a p-dimensional simplex (i.e., a p-simplex) is a p-dimensional polytope that is the convex hull of its p + 1 vertices (i.e., nodes). a simplicial complex k is a set of simplices such that (1) every face of a simplex from s is also in s and (2) the intersection of any two simplices σ 1 , σ 2 ∈ s is a face of both σ 1 and σ 2 . an increasing sequence k 1 ⊂ k 2 ⊂ · · · ⊂ k l of simplicial complexes forms a filtered simplicial complex; each k i is a subcomplex. as discussed in [127] and references therein, one can examine the homology of each subcomplex. in studying the homology of a topological space, one computes topological invariants that quantify features of different dimensions [53] . one studies "persistent homology" (ph) of a filtered simplicial complex to quantify the topological structure of a data set (e.g., a point cloud) across multiple scales of such data. the goal of such "topological data analysis" (tda) is to measure the "shape" of data in the form of connected components, "holes" of various dimensionality, and so on [127] . from the perspective of network analysis, this yields insight into types of large-scale structure that complement traditional ones (such as community structure). see [178] for a friendly, nontechnical introduction to tda. a natural goal is to generalize ideas from network analysis to simplicial complexes. important efforts include generalizing configuration models of random graphs [48] to random simplicial complexes [15, 32] ; generalizing well-known network growth mechanisms, such as preferential attachment [13] ; and developing geometric notions, like curvature, for networks [156] . an important modeling issue when studying higher-order network data is the question of when it is more appropriate (or convenient) to use the formalisms of hypergraphs or simplicial complexes. the computation of ph has yielded insights on a diverse set of models and applications in network science and complex systems. examples include granular materials [95, 129] , functional brain networks [54, 165] , quantification of "political islands" in voting data [42] , percolation theory [169] , contagion dynamics [174] , swarming and collective behavior [179] , chaotic flows in odes and pdes [197] , diurnal cycles in tropical cyclones [182] , and mathematics education [28] . see the introduction to [127] for pointers to numerous other applications. most uses of simplicial complexes in network science and complex systems have focused on tda (especially the computation of ph) and its applications [127, 131, 155] . in this chapter, however, i focus instead on a somewhat different (and increasingly popular) topic: the generalization of dynamical processes on and of networks to simplicial complexes to study the effects of higher-order interactions on network dynamics. simplicial structures influence the collective behavior of the dynamics of coupled entities on networks (e.g., they can lead to novel bifurcations and phase transitions), and they provide a natural approach to analyze p-entity interaction terms, including for p ≥ 3, in dynamical systems. existing work includes research on linear diffusion dynamics (in the form of hodge laplacians, such as in [162] ) and generalizations of a variety of other popular types of dynamical processes on networks. given the ubiquitous study of coupled kuramoto oscillators [149] , a sensible starting point for exploring the impact of simultaneous coupling of three or more oscillators on a system's qualitative dynamics is to study a generalized kuramoto model. for example, to include both two-entity ("two-body") and three-entity interactions in a model of coupled oscillators on networks, we write [172] x where f i describes the dynamics of oscillator i and the three-oscillator interaction term w i jk includes two-oscillator interaction terms w i j (x i , x j ) as a special case. an example of n coupled kuramoto oscillators with three-term interactions is [172] where we draw the coefficients a i j , b i j , c i jk , α 1i j , α 2i j , α 3i jk , α 4i jk from various probability distributions. including three-body interactions leads to a large variety of intricate dynamics, and i anticipate that incorporating the formalism of simplicial complexes will be very helpful for categorizing the possible dynamics. in the last few years, several other researchers have also studied kuramoto models with three-body interactions [92, 93, 166] . a recent study [166] , for example, discovered a continuum of abrupt desynchronization transitions with no counterpart in abrupt synchronization transitions. there have been mathematical studies of coupled oscillators with interactions of three or more entities using methods such as normal-form theory [14] and coupled-cell networks [59] . an important point, as one can see in the above discussion (which does not employ the mathematical formalism of simplicial complexes), is that one does not necessarily need to explicitly use the language of simplicial complexes to study interactions between three or more entities in dynamical systems. nevertheless, i anticipate that explicitly incorporating the formalism of simplicial complexes will be useful both for studying coupled oscillators on networks and for other dynamical systems. in upcoming studies, it will be important to determine when this formalism helps illuminate the dynamics of multi-entity interactions in dynamical systems and when simpler approaches suffice. several recent papers have generalized models of social dynamics by incorporating higher-order interactions [75, 76, 118, 137] . for example, perhaps somebody's opinion is influenced by a group discussion of three or more people, so it is relevant to consider opinion updates that are based on higher-order interactions. some of these papers use some of the terminology of simplicial complexes, but it is mostly unclear (except perhaps for [75] ) how the models in them take advantage of the associated mathematical formalism, so arguably it often may be unnecessary to use such language. nevertheless, these models are very interesting and provide promising avenues for further research. petri and barrat [137] generalized activity-driven models to simplicial complexes. such a simplicial activity-driven (sad) model generates time-dependent simplicial complexes, on which it is desirable to study dynamical processes (see section 4), such as opinion dynamics, social contagions, and biological contagions. the simplest version of the sad model is defined as follows. • each node i has an activity rate a i that we draw independently from a distribution f(x). • at each discrete time step (of length ∆t), we start with n isolated nodes. each node i is active with a probability of a i ∆t, independently of all other nodes. if it is active, it creates a (p − 1)-simplex (forming, in network terms, a clique of p nodes) with p − 1 other nodes that we choose uniformly and independently at random (without replacement). one can either use a fixed value of p or draw p from some probability distribution. • at the next time step, we delete all edges, so all interactions have a constant duration. we then generate new interactions from scratch. this version of the sad model is markovian, and it is desirable to generalize it in various ways (e.g., by incorporating memory or community structure). iacopini et al. [76] recently developed a simplicial contagion model that generalizes an si process on graphs. consider a simplicial complex k with n nodes, and associate each node i with a state x i (t) ∈ {0, 1} at time t. if x i (t) = 0, node i is part of the susceptible class s; if x i (t) = 1, it is part of the infected class i. the density of infected nodes at time t is ρ(t) = 1 n n i=1 x i (t). suppose that there are d parameters 1 , . . . , d (with d ∈ {1, . . . , n − 1}), where d represents the probability per unit time that a susceptible node i that participates in a d-dimensional simplex σ is infected from each of the faces of σ, under the condition that all of the other nodes of the face are infected. that is, 1 is the probability per unit time that node i is infected by an adjacent node j via the edge (i, j). similarly, 2 is the probability per unit time that node i is infected via the 2-simplex (i, j, k) in which both j and k are infected, and so on. the recovery dynamics, in which an infected node i becomes susceptible again, proceeds as in the sir model that i discussed in section 4.2. one can envision numerous interesting generalizations of this model (e.g., ones that are inspired by ideas that have been investigated in contagion models on graphs). the study of networks is one of the most exciting and rapidly expanding areas of mathematics, and it touches on myriad other disciplines in both its methodology and its applications. network analysis is increasingly prominent in numerous fields of scholarship (both theoretical and applied), it interacts very closely with data science, and it is important for a wealth of applications. my focus in this chapter has been a forward-looking presentation of ideas in network analysis. my choices of which ideas to discuss reflect their connections to dynamics and nonlinearity, although i have also mentioned a few other burgeoning areas of network analysis in passing. through its exciting combination of graph theory, dynamical systems, statistical mechanics, probability, linear algebra, scientific computation, data analysis, and many other subjects -and through a comparable diversity of applications across the sciences, engineering, and the humanities -the mathematics and science of networks has plenty to offer researchers for many years. congestion induced by the structure of multiplex networks tie-decay temporal networks in continuous time and eigenvector-based centralities multilayer networks in a nutshell multilayer networks in a nutshell temporal and structural heterogeneities emerging in adaptive temporal networks synchronization in complex networks mathematical frameworks for oscillatory network dynamics in neuroscience turing patterns in multiplex networks morphogenesis of spatial networks evolving voter model on dense random graphs generative benchmark models for mesoscale structure in multilayer networks birth and stabilization of phase clusters by multiplexing of adaptive networks network geometry with flavor: from complexity to quantum geometry chaos in generically coupled phase oscillator networks with nonpairwise interactions topology of random geometric complexes: a survey explosive transitions in complex networksõ structure and dynamics: percolation and synchronization factoring and weighting approaches to clique identification mathematical models in population biology and epidemiology how does active participation effect consensus: adaptive network model of opinion dynamics and influence maximizing rewiring anatomy of a large-scale hypertextual web search engine a model for the influence of media on the ideology of content in online social networks frequency-based brain networks: from a multiplex network to a full multilayer description statistical physics of social dynamics bootstrap percolation on a bethe lattice configuration models of random hypergraphs annotated hypergraphs: models and applications hebbian learning architecture and evolution of semantic networks in mathematics texts a model for spatial conflict reaction-diffusion processes and metapopulation models in heterogeneous networks multiple-scale theory of topology-driven patterns on directed networks generalized network structures: the configuration model and the canonical ensemble of simplicial complexes structure and dynamics of core/periphery networks the physics of spreading processes in multilayer networks mathematical formulation of multilayer networks navigability of interconnected networks under random failures identifying modular flows on multilayer networks reveals highly overlapping organization in interconnected systems explosive phenomena in complex networks graph fission in an evolving voter model a practical guide to stochastic simulations of reaction-diffusion processes persistent homology of geospatial data: a case study with voting limitations of discrete-time approaches to continuous-time contagion dynamics is the voter model a model for voters? the use of multilayer network analysis in animal behaviour on eigenvector-like centralities for temporal networks: discrete vs. continuous time scales community detection in networks: a user guide configuring random graph models with fixed degree sequences nine challenges in incorporating the dynamics of behaviour in infectious diseases models modelling the influence of human behaviour on the spread of infectious diseases: a review anatomy and efficiency of urban multimodal mobility random hypergraphs and their applications elementary applied topology two's company, three (or more) is a simplex binary-state dynamics on complex networks: pair approximation and beyond quantum graphs: applications to quantum chaos and universal spectral statistics the structural virality of online diffusion patterns of synchrony in coupled cell networks with multiple arrows finite-size effects in a stochastic kuramoto model dynamical interplay between awareness and epidemic spreading in multiplex networks threshold models of collective behavior on the critical behavior of the general epidemic process and dynamical percolation a matrix iteration for dynamic network summaries a dynamical systems view of network centrality adaptive coevolutionary networks: a review epidemic dynamics on an adaptive network pathogen mutation modeled by competition between site and bond percolation ergodic theorems for weakly interacting infinite systems and the voter model modern temporal network theory: a colloquium nonequilibrium phase transition in the coevolution of networks and opinions temporal networks temporal networks temporal network theory an adaptive voter model on simplicial complexes simplical models of social contagion turing instability in reaction-diffusion models on complex networks games on networks the large graph limit of a stochastic epidemic model on a dynamic multilayer network a local perspective on community structure in multilayer networks structure of growing social networks wave propagation in fractal trees synergistic effects in threshold models on networks hipsters on networks: how a minority group of individuals can lead to an antiestablishment majority drift of spectrally stable shifted states on star graphs maximizing the spread of influence through a social network second look at the spread of epidemics on networks centrality prediction in dynamic human contact networks mathematics of epidemics on networks multilayer networks authoritative sources in a hyperlinked environment dynamics of multifrequency oscillator communities finite-size-induced transitions to synchrony in oscillator ensembles with nonlinear global coupling pattern formation in multiplex networks quantifying force networks in particulate systems quantum graphs: i. some basic structures fitting in and breaking up: a nonlinear version of coevolving voter models from networks to optimal higher-order models of complex systems hawkes processes complex spreading phenomena in social systems: influence and contagion in real-world social networks wave mitigation in ordered networks of granular chains centrality metric for dynamic networks control principles of complex networks resynchronization of circadian oscillators and the east-west asymmetry of jet-lag transitivity reinforcement in the coevolving voter model ground state on the dumbbell graph random walks and diffusion on networks the nonlinear heat equation on dense graphs and graph limits multi-stage complex contagions opinion formation and distribution in a bounded-confidence model on various networks network motifs: simple building blocks of complex networks portrait of political polarization six susceptible-infected-susceptible models on scale-free networks a network-based dynamical ranking system for competitive sports community structure in time-dependent, multiscale, and multiplex networks multi-body interactions and non-linear consensus dynamics on networked systems scientific collaboration networks. i. network construction and fundamental results network structure from rich but noisy data collective phenomena emerging from the interactions between dynamical processes in multiplex networks nonlinear schrödinger equation on graphs: recent results and open problems complex contagions with timers a theory of the critical mass. i. interdependence, group heterogeneity, and the production of collective action interaction mechanisms quantified from dynamical features of frog choruses a roadmap for the computation of persistent homology chimera states: coexistence of coherence and incoherence in networks of coupled oscillators network analysis of particles and grains epidemic processes in complex networks topological analysis of data master stability functions for synchronized coupled systems cluster synchronization and isolated desynchronization in complex networks with symmetries bayesian stochastic blockmodeling modelling sequences and temporal networks with dynamic community structures activity driven modeling of time varying networks simplicial activity driven model the multilayer nature of ecological networks network analysis and modelling: special issue of dynamical systems on networks: a tutorial the role of network analysis in industrial and applied mathematics a network analysis of committees in the u.s. house of representatives communities in networks spectral centrality measures in temporal networks reality inspired voter models: a mini-review the kuramoto model in complex networks core-periphery structure in networks (revisited) dynamic pagerank using evolving teleportation memory in network flows and its effects on spreading dynamics and community detection recent advances in percolation theory and its applications dynamics of dirac solitons in networks simplicial complexes and complex systems comparative analysis of two discretizations of ricci curvature for complex networks dynamics of interacting diseases null models for community detection in spatially embedded, temporal networks modeling complex systems with adaptive networks social diffusion and global drift on networks the effect of a prudent adaptive behaviour on disease transmission random walks on simplicial complexes and the normalized hodge 1-laplacian multiopinion coevolving voter model with infinitely many phase transitions the architecture of complexity the importance of the whole: topological data analysis for the network neuroscientist abrupt desynchronization and extensive multistability in globally coupled oscillator simplexes complex macroscopic behavior in systems of phase oscillators with adaptive coupling sine-gordon solitons in networks: scattering and transmission at vertices topological data analysis of continuum percolation with disks from kuramoto to crawford: exploring the onset of synchronization in populations of coupled oscillators motor primitives in space and time via targeted gain modulation in recurrent cortical networks multistable attractors in a network of phase oscillators with threebody interactions analysing information flows and key mediators through temporal centrality metrics topological data analysis of contagion maps for examining spreading processes on networks eigenvector-based centrality measures for temporal networks supracentrality analysis of temporal networks with directed interlayer coupling tunable eigenvector-based centralities for multiplex and temporal networks topological data analysis: one applied mathematicianõs heartwarming story of struggle, triumph, and ultimately, more struggle topological data analysis of biological aggregation models partitioning signed networks on analytical approaches to epidemics on networks using persistent homology to quantify a diurnal cycle in hurricane felix resolution limits for detecting community changes in multilayer networks analytical computation of the epidemic threshold on temporal networks epidemic threshold in continuoustime evolving networks network models of the diffusion of innovations graph spectra for complex networks non-markovian infection spread dramatically alters the susceptible-infected-susceptible epidemic threshold in networks temporal gillespie algorithm: fast simulation of contagion processes on time-varying networks forecasting elections using compartmental models of infection ranking scientific publications using a model of network traffic coupled disease-behavior dynamics on complex networks: a review social network analysis: methods and applications a simple model of global cascades on random networks braess's paradox in oscillator networks, desynchronization and power outage inferring symbolic dynamics of chaotic flows from persistence continuous-time discrete-distribution theory for activitydriven networks an analytical framework for the study of epidemic models on activity driven networks modeling memory effects in activity-driven networks models of continuous-time networks with tie decay, diffusion, and convection key: cord-303197-hpbh4o77 authors: humboldt-dachroeden, sarah; rubin, olivier; frid-nielsen, snorre sylvester title: the state of one health research across disciplines and sectors – a bibliometric analysis date: 2020-06-06 journal: one health doi: 10.1016/j.onehlt.2020.100146 sha: doc_id: 303197 cord_uid: hpbh4o77 there is a growing interest in one health, reflected by the rising number of publications relating to one health literature, but also through zoonotic disease outbreaks becoming more frequent, such as ebola, zika virus and covid-19. this paper uses bibliometric analysis to explore the state of one health in academic literature, to visualise the characteristics and trends within the field through a network analysis of citation patterns and bibliographic links. the analysis focuses on publication trends, co-citation network of scientific journals, co-citation network of authors, and co-occurrence of keywords. the bibliometric analysis showed an increasing interest for one health in academic research. however, it revealed some thematic and disciplinary shortcomings, in particular with respect to the inclusion of environmental themes and social science insights pertaining to the implementation of one health policies. the analysis indicated that there is a need for more applicable approaches to strengthen intersectoral collaboration and knowledge sharing. silos between the disciplines of human medicine, veterinary medicine and environment still persist. engaging researchers with different expertise and disciplinary backgrounds will facilitate a more comprehensive perspective where the human-animal-environment interface is not researched as separate entities but as a coherent whole. further, journals dedicated to one health or interdisciplinary research provide scholars the possibility to publish multifaceted research. these journals are uniquely positioned to bridge between fields, strengthen interdisciplinary research and create room for social science approaches alongside of medical and natural sciences. one health joins the three interdependent sectors -animal health, human health, and ecosystems -with the goal to holistically address health issues such as zoonotic diseases or antimicrobial resistance (1) . in 2010, the food and agriculture organization (fao), the world organisation for animal health (oie) and the world health organization (who) engaged in a tripartite collaboration to ensure a multisectoral perspective to effectively manage and coordinate a one health approach. one health is defined as "an approach to address a health threat at the human-animal-environment interface based on collaboration, communication, and coordination across all relevant sectors and disciplines, with the ultimate goal of achieving optimal health outcomes for both people and animals; a one health approach is applicable at the subnational, national, regional, and global level" (2). this paper uses bibliometric analysis to explore the state of one health in academic literature, to visualise between the disciplines of human medicine, veterinary medicine and environment still persist -even in the face of the one health approach. the data for the bibliometric analysis is drawn from the web of science (wos). the wos is arguably one of the largest academic multidisciplinary databases, and it contains more than 66,9 million contributions from the natural sciences (science citation index expanded), social sciences (social sciences citation index) and humanities (arts & humanities citation index) (7). the broad scope of the database aligns well with the one health concept's cross-disciplinary approach. the analytical period is demarcated by the first one health publication included in the wos in 1998 and it ends in december 2019. the search term "one health" was applied to compile the first crude sample of articles that mention the concept of one health in their title, keywords or abstract. the basic assumption is that articles conducting one health research ( whether conceptually, methodologically and/or empirically) would as a minimum have mentioned "one health" in the abstract, title or keywords. the literature search resulted in 2.004 english articles, see flow chart in figure 1. however, this sample also included a sizable group of articles that just made use of "one health" in a sentence such as "one health district" or "one health professional". to restrict the sample to contributions only pertaining to the concept of one health, two subsequent screening measures were taken. first, 587 contributions which used one health as a keyword were automatically included in the the bibliometric analysis was conducted with the bibliometrix package for the r programming language. the analysis focuses on: 1) publication trends, 2) co-citation network of scientific journals, 3) co-citation network of authors, and 4) co-occurrence of keywords. the publication trend is outlined using both absolute and relative number of one health publications. the co-citation networks of scientific journals provide information on the disciplinary structure of the field of one health while the co-citation network of authors disaggregates further to the citation patterns of individual authors. the co-citation network of journals shows the relation between the publications within the outlets. for example, when a publication within journal a cites publications within journals b and c, it indicates that journals b and c share similar characteristics. the more journals citing both b and c, the stronger their similarity. to minimise popularity bias among frequently cited journals, co-citation patterns are normalised through the jaccard index. the jaccard index measures the similarity between journals b and c as the intersection of journals citing both b and c, divided by the total number of journals that cited b and c individually (8, 9) . like the co-citation network of journals, the co-citation network for authors measures the similarity of authors in terms of how often they are cited by other authors , also normalised through the jaccard index. when author a cites both authors b and c, it signifies that b and c share similar characteristics. the study also investigates the co-occurrence of keywords to identify the content of one health publications. here, co-occurrence measures the similarity of keywords based on the number of times they occur together in different articles. it provides information on the main other topical keywords linked to one health and can thus be used to gauge the knowledge structure of the field. here, the articles keywords plus are the unit of analysis. wos automatically generates keywords plus based on the words or phrases appearing most frequently in an articles bibliography. keywords plus are more fruitful for bibliometric analyses than author keywords, as they convey more general themes, methods and research techniques (10) . disciplinary clusters within the networks, illustrated by the colours in figures 3 to 5, are identified empirically applying the louvain clustering algorithm. louvain is a hierarchical clustering algorithm that attempts to maximise modularity, measured by the density of edges between nodes within communities and sparsity between nodes across communities. the nodes represent the aggregated citations of the academic journals and the edges, the line between two nodes, display the relation between the journals. the shorter the path between the nodes the stronger their relation. node size indicates "betweenness centrality" in the network, which is a measure of the number of shortest paths passing through each node (11) . betweenness centrality estimates the importance of a node on the flow of information through the network, based on the assumption that information generally flows through the most direct communicative pathways. for example, the one health publications in our sample relating to ebola have more than tripled after 2016. one might, therefore, expect to observe a similar spike in one health publications that study the covid-19 outbreak in 2020. while the use of the one health concept has increased, the co-citation network shows that the increase is mostly driven by the sectors of human and veterinary medicine, evidenced by their centrality in terms of information flows within the network. relations to other clusters. the area of parasitology is also mostly co-cited in its own area. here, most aggregated citations are rooted in the journal plos neglected tropical diseases. in these last two clusters, microbiology and parasitology, the journals cover topics mainly exclusively pertaining to medical or biological sciences. the most active one health scholars, publishing more than ten articles over the last 12 years, are from the field of veterinary research. of the top six researchers, five have a veterinary background (jakob zinsstag, jonathan rushton, esther schelling, barbara häsler and bassirou bonfoh). while degeling is the only researcher of the top six with an education in the social sciences, the remaining five veterinarian scholars do touch upon social science themes within their publications, relating to systemic or conceptual approaches, sociopolitical dimensions and knowledge integration (e.g. zinsstag and schelling (14) ; häsler (15) ; rushton (16) . five of the six most productive researchers work in europe and three of them are associated with the same institute, namely the swiss tropical and public health institute (zinsstag, schelling and bonfoh) (17) .there has been some cooperation across institutes and department as evidence by the coauthorships of zinsstag and häsler, häsler and rushton, rushton and zinsstag (e.g. (18) (19) (20) ). figure 4 illustrates the co-citation network of authors. four clusters of authors emerged in the network (green: zoonoses and epidemiology; blue: biodiversity and ecohealth; purple: animal health, public health; red: policy-related disciplines). academic scholars are mainly found in the green, blue and purple clusters, whereas the authors of the red clusters are mainly represented by organisations such as the who, cdc, perspectives from the environmental and ecological sector have been neglected within one health research (24, 25) . further, the co-occurrence network of keywords illustrated that research into one health is mainly undertaken in the medical science cluster with the most connections to the other clusters. this indicates that a majority of articles is constructed around medical themes, and that there is most interdisciplinary research across areas in the medical science cluster. however, few keywords indicate research into administrative or anthropological approaches to examine the management of one health. making these thematic perspectives more central to the network could strengthen the one health approach regarding implementation and institutionalisation. one health initiatives and projects that specifically promote mixed methods studies and engage researchers with various expertise could facilitate implementing comprehensive initiatives. here, a gap in the one health research could be addressed, facilitating not only quantitative but a qualitative research to comprehensively approach the multifaceted issues implied in one health topics (26) . there is no shortage of existing outlets, frameworks and approaches that promote interdisciplinary research. already in 2008, a strategic framework was developed by the tripartite collaborators, as well as the un system influenza coordination, unicef and the world bank, outlining approaches for collaboration, to prevent crises, to govern disease control and surveillance programmes (27) . rüegg et al. developed a handbook to adapt, improve and optimise one health activities could also provide some guidance on how to strengthen future one health activities and evaluate already ongoing one health initiatives (18) . coker et al. produced a conceptual framework for one health, which can be used to develop a strong research the fao-oie-who collaboration -sharing responsibilities and coordinating global activities to address health risks at the animal-human-ecosystems interfaces -a tripartite concept note applied informetrics for digital libraries: an overview of foundations, problems and current approaches transdisciplinary and social-ecological health frameworks-novel approaches to emerging parasitic and vector-borne diseases posthumanist critique and human health: how nonhumans (could) figure in public health research citation index is not critically important to veterinary pathology on the normalization and visualization of author co-citation data: salton's cosine versus the jaccard index similarity measures in scientometric research: the jaccard index versus salton's cosine formula. information processing & management comparing keywords plus of wos and author keywords: a case study of patient adherence research fast unfolding of communities in large networks ebola outbreak distribution in west africa ebola virus disease) reporting and surveillance -zika virus from "one medicine" to "one health" and systemic approaches to health and well-being knowledge integration in one health policy formulation, implementation and evaluation towards a conceptual framework to support one-health research for policy on emerging zoonoses swiss tph -swiss tropical and public health institute integrated approaches to health: a handbook for the evaluation of one health a review of the metrics for one health benefits a blueprint to evaluate one health. front public health implementing a one health approach to emerging infectious disease: reflections on the socio-political, ethical and legal dimensions overcoming challenges for designing and a framework for one health research. one health the growth and strategic functioning of one health networks: a systematic analysis. the lancet planetary health qualitative research for one health: from methodological principles to impactful applications. front vet sci contributing to one world, one health* -a strategic framework for reducing risks of infectious diseases at the animal -human-ecosystems interface birds of a feather: homophily in social networks homophily in co-autorship networks key: cord-354783-2iqjjema authors: wang, wei; ma, yuanhui; wu, tao; dai, yang; chen, xingshu; braunstein, lidia a. title: containing misinformation spreading in temporal social networks date: 2019-04-24 journal: chaos doi: 10.1063/1.5114853 sha: doc_id: 354783 cord_uid: 2iqjjema many researchers from a variety of fields including computer science, network science and mathematics have focused on how to contain the outbreaks of internet misinformation that threaten social systems and undermine societal health. most research on this topic treats the connections among individuals as static, but these connections change in time, and thus social networks are also temporal networks. currently there is no theoretical approach to the problem of containing misinformation outbreaks in temporal networks. we thus propose a misinformation spreading model for temporal networks and describe it using a new theoretical approach. we propose a heuristic-containing (hc) strategy based on optimizing final outbreak size that outperforms simplified strategies such as those that are random-containing (rc) and targeted-containing (tc). we verify the effectiveness of our hc strategy on both artificial and real-world networks by performing extensive numerical simulations and theoretical analyses. we find that the hc strategy greatly increases the outbreak threshold and decreases the final outbreak threshold. many communications platforms, e.g., twitter, facebook, email, whatsapp, and mobile phones, allow numerous ways of sharing information [1] [2] [3] [4] [5] [6] . one task for researchers is developing ways to distinguish between true and false information, i.e., between "news" and "fake news" [7] . this task is important because access to true information is essential in the process of intelligent decisionmaking [8] [9] [10] . for example, when the severe acute respiratory syndrome (sars) spread across guangzhou, china in 2003, the chinese southern weekly published a newspaper article entitled "there is a fatal flu in guangzhou." this information was forwarded over 126 million times by tv news and in other newspapers [11, 12] . individuals receiving this true information could adopt simple, effective protective measures against being infected (e.g., by staying at home, washing hands, or wearing masks). misinformation, on the other hand, encourages irrational behavior and reckless decision-making, and its spread can undermine societal well-being and sway the outcome of elections [13] [14] [15] [16] . bovet and makse [17] analyzed 171 million tweets sent during the five months prior to the 2016 us presidential election and found that misinformation strongly affected the outcome of that election. to contain the spread of misinformation we must understand the dynamic information spreading mechanisms that facilitate it [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] . vosoughi et al. [1] examined true and fake information on twitter from 2006 to 2017 and found that misinformation spreads more quickly than true information. using the spreading mechanisms common in real-data analysis, researchers have proposed several mathematical models to describe the spreading dynamics of true and fake information [29] [30] [31] [32] [33] [34] . moreno et al. [29] developed mean-field equations to describe the spread of classical misinformation on static scale-free networks that enables a theoretical study not requiring extensive numerical simulations. borge-holthoefer and moreno [35] found that although there are no influential spreaders in the classical misinformation model presented in ref. [29] , nodes with high k-cores and ranking values are more likely to be the influential spreaders of true information and also of infectious diseases [36] [37] [38] [39] [40] . when we include the burst behavior of individuals in the misinformation model, hubs emerge as influential nodes [41] . using real-world data, researchers found that social networks evolve with time, and thus evolving temporal networks more accurately represent the topology of real-world networks than static networks [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] . researchers have found that the temporal nature of networks strongly affect their spreading dynamics. perra et al. [53] found that in susceptible-infected-susceptible (sis) epidemic spreading a temporal network behavior suppresses the spreading more effectively than a static integrated network. researchers have also found that the sis and susceptible-infected-recovered (sir) models on temporal networks exhibit the same outbreak threshold [54] [55] [56] . nadini et al. [49] found that tightly connected clusters in temporal networks inhibit sir processes, but accelerate sis spreading. pozzana et al. found that when node attractiveness-its role as a preferential target for interactions-in temporal networks is heterogeneous, the contagion process is altered [57] . karsai et al. [58] found that strong ties between individuals strongly inhibit the classical spreading dynamics of misinformation. several strategies for containing the spread of misinformation in temporal networks have been proposed [59] [60] [61] [62] . liu et al. [60] examined epidemic spreading on activity driven temporal networks and developed mean-field based theoretical approaches for three different control strategies, i.e., random, targeted, and egocentric. the egocentric strategy is most effective. it immunizes a randomly selected neighbor of a node in the observation window. other effective approaches using extensive numerical simulations have been proposed [63] [64] [65] . for example, holme and liljeros [63] take into consideration the time variation of nodes and edges and propose a strategy for containing the outbreak of an epidemic based on the birth and death of links. because there is still no theoretical approach to containing the spread of misinformation in temporal networks, we here systematically examine its spread in activity-driven networks. the rest of the paper is organized as follows. section ii describes the misinformation spreading dynamics on temporal networks and develops a theory to describe the spreading dynamics. section iv proposes three containment strategies. section v describes the results of our extensive numerical stochastic simulations, which show that our suggested theory agrees with the numerical simulations. section iv presents our conclusions. we here introduce our model for the spreading dynamics of misinformation in temporal networks. the widely used approaches mathematically describing temporal networks tend to be either eventbased or snapshot representations [47] . the event-based representation approach describes de-scribes temporal networks using ordered events {u i , v i , t i , ∆t i ; i = 1, 2, · · · }, where node u i and v i are connected at time t i in the time period ∆t i . the snapshot approach describes temporal networks using a discrete sequence of static networks g = {g(1), g(2), · · · , g(t max )}, where g(t) is the snapshot network at time t, and t max is the number of snapshots of the temporal network. each snapshot network g(t), contains n nodes, where n is fixed, and m t edges. thus the average temporal degree of snapshot network g(t) is k t = 2m t /n. using the adjacency matrix, we here adopt the snapshot approach to describe temporal networks. as the meaning of the adjacency matrix in static networks, a uv (t) = 1 when nodes u and v are connected at time t, for undirected temporal networks. the average degree of node u in the temporal network g is knowing the adjacency matrix a we obtain the eigenvalues of a. here λ 1 (a) λ 2 (a), · · · , and λ n (a) are the eigenvalues of a in decreasing order. the spectral radius is thus λ 1 (a), which quantifies the threshold outbreak of epidemics in temporal networks. we use the classical activity-driven network [53] to model a temporal network with n nodes. we build the activity-driven network using the following steps. 1. we assign to each node i an activity potential value x i according to a given probability density distribution f (x). the activity of node i is a i = ηx i , where η is a rescaling factor, i.e., at each time step, node i is active with probability a i . the higher the value of η, the higher the average degree of the temporal network. the higher the value of a i , the higher the degree of node i. we assume that f (x) follows a power-law function, i.e., f (x) ∼ x −γ , where γ is the potential exponent. we make this assumption in order to generate a heterogeneous degree distribution temporal network. after further calculations we find and ǫ is the minimum value of the activity potential x i . each active node has m edges, and each edge randomly links to a network node. an edge connects the same pair of nodes with probability m/n. note that in the thermodynamic limit of a sparse temporal network, there are no multiple edges between nodes and non-local loops 2. at the end of time step t, we delete all edges in network g(t). (2) and (3) until t max in order to generate temporal network g. we use an ignorant-spreader-refractory model to describe the spreading dynamics of misinformation [29] . here nodes are classified as either ignorant, spreader, or refractory. ignorant nodes are unaware that the information is false but are susceptible to adopting it. spreader nodes are aware that the information is false and are willing to transmit it to ignorant nodes. refractory nodes receive the misinformation but do not spread it. the misinformation spreading dynamics on temporal networks evolves as follows. we first randomly select a small fraction ρ 0 of spreader nodes to be seeds in network g(t 0 ), where 1 ≤ t 0 ≤ t max . we designate the remaining 1 − ρ 0 nodes to be ignorant. at time step t each spreader i transmits with probability λ the misinformation to ignorant neighbors in network g(t). in addition, each spreader i becomes a refractory node with a probability where µ is the intrinsic recovery probability, and n(t) is the number of nodes in the spreader and refractory states of node i. the dynamics evolve until there are no spreader nodes. note that when t reaches t max , the misinformation spreads on g(1) in the next time step. figure 1 shows misinformation spreading on a temporal network. we here develop a generalized discrete markovian approach to describe the misinformation spreading dynamics on temporal networks [54] . we denote i i (t), s i (t), and r i (t) to be the fraction of nodes in the ignorant, spreader, and refractory states, respectively, at time t. because a node can only be in one of the three states, an ignorant node i, becomes a spreader with probability p i (t) at time t, where here is the probability that node i has not received any misinformation from neighbors at time t in g(t). at time step t + 1, node i remains ignorant with a probability the decrease of i i (t) is equal to the increase of s i (t), because an ignorant node will become a spreader when it obtains the information from neighbors in state s. in addition, spreader i becomes . thus the evolution of node i in the spreader state is the evolution of node i in the refractory state is using eqs. (3)-(5), we obtain the fraction of nodes at time t in each state, where h ∈ {i, s, r}. note that in the steady state there are no spreader nodes, only refractory and stifler nodes. the fraction of nodes that receive the misinformation in the final state is r(∞) = r. here r is the order parameter of a continuous phase transition with λ. if the misinformation transmission probability λ is larger than the critical threshold, i.e., λ > λ c , the size of the global misinformation is of the order of the system size. otherwise, the global misinformation r = 0 for λ ≤ λ c is in the thermodynamic limit. at shorter times, a vanishingly small fraction of nodes receive the misinformation, i.e., s i (t) ≈ 0 and r i (t) = 1 − s i (t) − i i (t) ≈ 0. the recovery probability in eq. (1) of a spreader node i is µ i (t) ≈ µ, since node i must connect to a spreader that supplies the misinformation, and there is a low probability that it will connect to other spreader or refractory neighbors. thus eq. (4) can be rewritten where δ ij is the kronecker delta function, i.e., δ ij = 1 if i = j, and zero otherwise. we define the transmission tensor m to be we mask the tensorial origin of the space through the map inserting eq. (9) into (7), we have where s(τ ) is the probability that a node is in the spreader state at each time step t during here s(τ ) increases exponentially if the largest eigenvalue of m, denoted λ 1 , is larger than 1. thus the misinformation spreads, and the threshold condition is [54] λ 1 (m) = 1, in an unweighted undirected network g, the largest eigenvalue λ 1 (m) of m is where p = tmax t=1 (1 − µ + λa(t)). because misinformation spreading on social networks can induce social instability, threaten political security, and endanger the economy, we propose three strategies-random, targeted, and heuristic-for containing the spread of misinformation in temporal networks using a given fraction of containing nodes f . we first immunize a fraction of f nodes using a static containment strategy. the misinformation then spreads on the residual temporal network. if node i is "immunized," it cannot be infected (transmit) by the misinformation received from neighbors (the misinformation to neighbors). mathematically the immunized node set is v, and the number of immunized nodes equals the number of elements in v, i.e., |v| = ⌈f n⌉. we set v i = 1 if node i is immunized, otherwise v i = 0. after immunization, eqs. (2)-(4) can be written and respectively. in an effective containing strategy the misinformation spreading dynamics is suppressed for a given fixed fraction f of immunized nodes, i.e., the objective function is where the constraint conditions are eqs. • strategy i: random containment (rc). the most used strategy for containing the spread of misinformation is randomly immunizing a fraction of f nodes [66] . • strategy ii: targeted containment (tc). another intuitive way is to immunize the nodes with highest average degree k in the temporal network g. specifically, we first compute the average degree of each node i as k i = 1 tmax tmax t=1 n j=1 a ij (t). we then rank all nodes in descending order in the vector w according to the average degree of each node. finally we immunize the top ⌈f n⌉ nodes of w. • strategy iii: heuristic containment (hc). using the tc, we apply an hc strategy. because tc is much better than rc, we perform the hc strategy by replacing the immunization nodes. when the repeat time is very large, the immunized nodes reach an optimal value. (i): we initialize a vector w according to the descending order of the average degree of nodes. the first ⌈f n⌉ nodes of w are immunized, the final misinformation outbreak size is r o , and w 0 is denoted a set. the remaining nodes w 1 = w\w 0 are denoted a set. (ii): we randomly select nodes in w 0 and w 1 , denoted v 0 and v 1 , respectively. we switch their order in vector w and denote the new vector w n . we immunize the first ⌈f n⌉ nodes w n and compute the final misinformation outbreak size r n . (iii): when r n > r o , we update vector w, i.e., w → w n . otherwise, there is no change. (iv): we repeat steps (ii) and (iii) until 1 ts ts i=1 |w −w n | < ǫ ′ . in the simulations we set t s = 100 and ǫ ′ = n −1 . for the activity-driven network, we set n = 10 3 , t max = 20, η = 10, m = 50, γ = 2.1, and ǫ = 10 −3 . for real-world networks, we use the data collected by the sociopatterns group [? ] , which records the interactions among the participants at a conference. the time resolution of the signal is 20 sec. because the temporal network is sparse, it is difficult for the information to spread in the original network. we thus aggregate the temporal network using four windows, w = 30min, 60min, 120min, and 240min. we average all simulation results more than 1000 times. we use variability to locate the numerical network-sized dependent outbreak threshold [67, 68] , where r is the relative size of misinformation spreading at the steady state. at the outbreak threshold λ c , χ exhibits a peak. when λ ≤ λ c , the global misinformation does not break out, but when λ > λ c the global misinformation does break out. figure 2 shows the misinformation spreading on activity-driven networks. note that the final misinformation outbreak size r increases with λ. the larger the recovery probability µ, the lower the values of r because spreader nodes are less likely to transmit the misinformation to stifler neighbors [see fig. 2(a) ]. note that our theoretical and numerical predictions of the final misinformation outbreak size r agree. figure 2(b) shows the variability χ as a function of λ. there is a peak at the misinformation outbreak threshold λ c . figure 3 shows λ c versus µ in which λ c increases linearly with µ. the theoretical predictions of λ c obtained from eq. (11) agree with the stochastic simulations. figure 4 shows misinformation spreading in real-world temporal networks. as in fig. 2 , r increases with λ and decreases with µ. as in sir epidemic spreading [59] , the effective outbreak threshold (λ/µ) c is a constant value. in addition, when the value aggregating window w is small, there are fewer opportunities for spreaders to transmit the misinformation to stifler neighbors, thus the misinformation does not break out globally, i.e., there are smaller values of r for smaller w. once again our theoretical results agree with the numerical simulations. we next examine the performances of our proposed strategies for mitigating misinformation spreading on artificial and real-world temporal networks. figure 5 shows r versus λ for different values of the fraction of containing nodes f . note that r decreases with f because no more nodes receive the misinformation. note also that the tc strategy performs much better than the rc strategy because the higher degree nodes k are contained, and spreaders can no longer transmit the misinformation to stiflers. thus when we immunize the same fraction of containing nodes, the values of r for the tc strategy are smaller than those for the rc strategy. for example, when f = 0.5, using the rc strategy ≈ 30% of the nodes are informed by the misinformation, but using the tc strategy none are informed by the misinformation. in addition, the outbreak threshold λ c to contain the misinformation the fraction of nodes informed by the misinformation is finite, i.e., r ≈ 0.25. our theoretical results agree with the numerical simulation results. an effective containing strategy with a fraction of immunized node f and a small outbreak threshold λ c greatly decreases the final misinformation outbreak size r. figure 7 shows the effective outbreak threshold (λ/µ) c versus f on activity-driven networks for the rc, tc, and hc strategies. here (λ/µ) c increases with f , and (λ/µ) c is the largest using the hc strategy when f is fixed. when f is sufficiently large, no λ value can induce a global misinformation outbreak. we denote f c the critical probability that at least a fraction of f c nodes must be containing to halt misinformation speading in temporal networks. we find that the values of f c for the hc containing strategy are the smallest of all containing strategies. the f c value for the rc strategy is 5 times the f c value for the tc strategy, and the f c value for the tc strategy is 2.5 times the f c value for the lines and symbols are the theoretical and numerical predictions of (λ/µ) c , respectively. the vertical line represents the critical probability f c . hc strategy. thus the hc strategy is the most effective. we finally examine real-world networks to varify the effectiveness of our proposed three strategies. figure 8 compares the performances of the tc and hc strategies by examining r versus f for given values of λ. as in fig. 6 , the hc strategy most effectively contains the misinformation spreading on temporal networks irrespective of the values of λ. in addition, fig. 9 shows that the effective outbreak threshold (λ/µ) c is the smallest when using the hc strategy. thus our theory accurately predicts the numerical simulation results. we have systematically examined the dynamics of misinformation spreading on temporal networks. we use activity driven networks to describe temporal networks, and use a discrete markothe lines and symbols are the theoretical and numerical predictions of r, respectively. vian chain to describe the spreading dynamics. we find that the global misinformation outbreak threshold correlates with the topology of temporal networks. using extensive numerical simulations, we find that our theoretical predictions agree with numerical predictions in both artificial and real-world networks. to contain misinformation spreading on temporal networks, we propose three strategies, random containing (rc), targeted containing (tc), and heuristic containing (hc) strategies. we perform numerical simulations and a theoretical analysis on both artificial and four real-world networks and find that the hc strategy outperforms the other two strategies, maximizes the outbreak threshold, and minimizes the final outbreak size. our proposed containing strategy expands our understanding of how to contain public sentiment and maintain social stability. this work was partially supported by the china postdoctoral science foundation (grant no. 2018m631073), and fundamental research funds for the central universities. lab thanks unmdp and con-icet weekly releases 2013 ieee 13th international conference on data mining complex spreading phenomena in social systems temporal networks a guidance to temporal networks temporal network epidemiology journal of physics: conference series key: cord-352049-68op3d8t authors: wang, xingyuan; zhao, tianfang; qin, xiaomeng title: model of epidemic control based on quarantine and message delivery date: 2016-09-15 journal: physica a doi: 10.1016/j.physa.2016.04.009 sha: doc_id: 352049 cord_uid: 68op3d8t the model provides two novel strategies for the preventive control of epidemic diseases. one approach is related to the different isolating rates in latent period and invasion period. experiments show that the increasing of isolating rates in invasion period, as long as over 0.5, contributes little to the preventing of epidemic; the improvement of isolation rate in latent period is key to control the disease spreading. another is a specific mechanism of message delivering and forwarding. information quality and information accumulating process are also considered there. macroscopically, diseases are easy to control as long as the immune messages reach a certain quality. individually, the accumulating messages bring people with certain immunity to the disease. also, the model is performed on the classic complex networks like scale-free network and small-world network, and location-based social networks. results show that the proposed measures demonstrate superior performance and significantly reduce the negative impact of epidemic disease. though the medical conditions are improved significantly, epidemics have never been away from human world, especially in developing countries. of growing concerns are adverse synergistic interactions between the emerging diseases and other infectious. poor sanitation and lack of medical knowledge lead to the wide spreading of disease in developing countries, such as ebola and mers (middle east respiratory syndrome coronavirus). besides, some of traditional contagions are resistant to drug treatments now, such as malaria, tuberculosis, and bacterial pneumonia. these kinds of diseases are defined as new emerging infectious disease (eid), which have increased in the past 20 years and will keep growing in the near future [1] . there are several remarkable characteristics in the eid: unpredictable, not preventable, irremediable, high mortality, rapid transmission and wide scope of influence. the expansion of disease spreading can lead to social panic at some level. it is hence imperative to study effective control strategies to prevent the disease diffusion. many of researchers have tried to explore the prevention measures of eid. as a practical method, the information diffusion on epidemic dynamic has attracted much attention in recent years. a path-breaking work in this field was taken by funk et al., who proposed an epidemiological model that considers the spread of awareness about the disease [2, 3] . lima et al. proposed that the propagation of disease can be reduced by the spreading of immune information, which make individuals resistant to disease and then work against the epidemic propagating [4] . a model of competing epidemic spreading over completely overlapping networks was proposed by karrer and newman, revealing a coexistence regime in which both types of spreading can infect a substantial fraction of the network [5] . wang and tang distinguish two types of disease spreading and proposed the dynamic model of asymmetrically interacting and disease spreading. their research focuses on three problems: the different network structures and information spreading dynamics; the asymmetric effects of one type of spreading dynamics on another; the timing of the two types of spreading [6] . moreover, several researches are also meaningful to us [7] [8] [9] [10] [11] [12] . trpevski et al. explored the rumors propagation in multiple networks [7] . zhang et al. proposed a model considering time delay and stochastic fluctuations [8] . ababou et al. investigated the spreading of periodic diseases and synchronization phenomena on exponential networks [9] . rakowski et al. put forward an individual based model to study the effects of influenza epidemic in poland. a simple transportation rule is established to mimic individuals' travels in dynamic route-changing schemes, allowing for the infection spread during a journey [10] . all the above efforts are worth approving but the specific ways of message spreading are ignored. this paper enriches the research in this area, and the proposed methods are shown to have a significant effect in epidemic prevention. the paper proceeds as follows. in section 2, we first describe the characteristics of epidemic, then the siqm models are developed, which contain the disease prevention measures based on quarantine and message delivery. section 3 gives the sensitivity analysis of the model on classical complex networks. in section 4, the common regularity of human mobility and experiment results on location-based social network are given. the paper is concluded in section 5. normally, the process of infectious disease can be divided into three stages: susceptible period, infection period, recovery period [13] . the infection period is further divided into latent period and invasion period. the former is characterized by the tiny indisposition; the latter is featured by onset of clinical signs and symptoms. moreover, according to the research of lessler et al., the latent period is essential to the investigation and control of infectious disease [14] . in this period, the infectious individuals have obtained the ability to spread diseases, yet without obvious symptom to arise attention, e.g., the latent period of influenza remains only one to three days, yet the epidemic can sweep through a city in less than six weeks. furthermore, according to the health experts sartwell's research, the period of latent period varies between individuals in the same regular fashion as do other biological characteristics [15] . the distribution of days seems to follow the ''normal'' curve. based on their theory, we investigate several common infectious diseases and collect their incubation period in table 1 and then get the average length of latent period which approximates to 7 days. to sum up, two assumptions are proposed here as the basic precondition of model: (1) the infection period of epidemic is divided into latent stage and invasion stage in epidemiology. because the disease in former stage is hard to observe and diagnose, we preset a relative low value to its isolating rate which ranges from 0.01 to 0.5. the rate in invasion period ranges from 0.5 to 1. (2) to simplify the model, the average length of latent period is uniformly set to 7 days, which is important to the experiments in section 4. the value is deduced by results shown in table 1 . individual status in the model is divided into four types: susceptible, infected, in quarantine and in messaging. the susceptible means the people who are vulnerable to the disease. the infected denotes people who have already been infected consciously or unconsciously. when in quarantine, it means the infected ones have been isolated and will not spread disease anymore. when in ''messaging'' status, the specific persons, denoting the ones who have just been isolated or their directly connected neighbors, would deliver specific messages to their neighbors. timely isolation and message-delivery are the main prevention measures in our model, especially the latter. just imagine that a new infectious disease outbreaks at a certain area and the messages about disease are open to public a few months later, it would be a large cost to control its scale, duration and damage. consider the epidemic sir (susceptible-infectious-recovery) model, which is firstly proposed by kermack and mckendrick [24] . now we improve the model by modifying the ''recovery'' status to ''quarantine'' status then constructing an enclosed and one-way evolution system. infection rate is defined as u and isolation rate as v. finally, the modified partial differential equations are given by: with nodes (termed as v ) representing individuals and edges (termed as e) representing connections between individuals. each node i can be in one of three statuses: susceptible, infectious and isolated. the status is described as a status vector, containing a single 1 in the position corresponding to the current state, and 0 everywhere else. let the probability function of each status is set to the state transition processes is shown in fig. 1 . status 0 is the initial state, and status 2 is the final state. when connected to the node in status 1, nodes in status 0 may turn into status 1 at a certain rate. nodes in status 1 could keep their status or convert into status 2 on either latent period or invasion period. evolution of the model is given by the formulas where multi realize can be interpreted as a realization of random realization for the probability distribution prob i (t), or a mapping from probability vector to status vector. for each node i, u i (t) and v i (t) represent its infecting rate and isolating rate respectively. in the study in section 2.1, the isolating rate of disease is defined as {v 1 |v 1 ∈ (0, 0.5)} when in latent period. when in invasion period, the isolating rate is defined as {v 2 |v 2 ∈ [0.5, 1)}. the infecting rate u i (t) is given by each node in status 1 tries to infect its neighbor node at time t. each try may be successful with a rate β. if j ∈ l i , set l ij = 1, and l ij = 0 otherwise. in formula parameter θ could take any of values in the set {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, which stand for the quality of information. it could be measured by the characteristics of messages like authority, accuracy, availability, etc. the higher value corresponds to the higher quality. moreover, the information quantity is defined by lima et al. have proposed that the public-facing information distribution may cause a considerable amount of neglect [4] . therefore, the directional message delivery is essential to warn the crowds who are under high risk of being infected. simultaneously, as the receiving and accumulating of messages, the newly received messages perform a gradually decreasing effect. accordingly, geometric series is adopted here to simulate the process. owing to its convergence properties, m i (t) will never be over 1. n 1 i and n 2 i , representing the total amount of direct and indirect messages respectively, are given by the messages, sent from one node (in status 2) to its direct-neighbors, are defined as direct messages, while the messages sent to the direct-neighbors' neighboring nodes are defined as indirect messages. a and b are the influence rates of two types of messages. for node i, n 1 i (t) represents the total number of its direct-neighbors who are isolated, and n 2 i (t) represents the total number of its neighbors's neighboring nodes who are isolated. represent the ratio of nodes in status 0, 1 and 2 respectively; let for the average value of infectious rate and average information quantity respectively. to facilitate understanding, a simple example is given in fig. 2 . initialize the global parameters with the length of latent period is defined as one time unit. each node in status 1 could spread disease to its neighbors and get isolated at a random probability of v 1 (in latent period) or v 2 (in invasive period). each node in status 2 sends messages to its neighbors and neighbors' neighboring nodes. those received messages will improve the nodes immunity and decrease infection rate. concretely, ''remark 1'' is calculated by all the infectors are finally isolated, and several lucky nodes become immune to the disease like node 7. this section presents the results of model behavior on complex networks. ba scale-free networks and ws small-world networks are contained, generated by algorithms. all the experiments are performed initially with two nodes in status 1 and others in status 0. if there are n nodes sorted in number in network, then the (n/3)th, (2n/3)th nodes are picked up as the initial nodes in status 1. first consider the scale-free network, which is generated by ba algorithm [25] . the network begins with m 0 nodes. every time adding a new node, we connect it to m (m < m 0 ) nodes which are already in the network. finally, the average degree of network ⟨k⟩ is equal to 2m, and degree distribution p k is approximate to k 3 . in fig. 3 , we present the results of model behavior under different conditions and situations. if without message diffusion ( fig. 3(a) and (b) ), most of nodes become infected in the epidemic. under quarantine measures, there is a red and unimodal curve shown in fig. 3(b) , which represents the number of nodes in status 1. statistical results show that the speed of disease spreading is delayed fivefold, and the peak value is reduced by 20%. in fig. 3(c) , we present a more acceptable result, where over a third of population is preserved from the disease only if received several messages in a medium-degree quality, and the peak of infection is decreased by around 50%. let then we get three dash lines which are nearly overlapping the solid lines. it could be equivalently represented for s (t) = p 0 (t) , i (t) = p 1 (t) , q (t) = p 2 (t). moreover, a clear comparison can be seen in fig. 3(d) , where u (t), which denotes the average infection rate, is investigated under three different situations as mentioned above. another finding is shown in fig. 3 (e) and (f). as the value of v 1 , which denoted the isolation rate in latent period, grows from 0.01 to 0.5, the peak value of i (t) reduces from 0.95 to 0.05 (fig. 3(e) ); with the same parameters, we change the value of v 2 from 0.5 to 0.95, then obtain the peak value i max (t) ≈ 0.4 (fig. 3(f) ). there is a particular conclusion: the increasing of isolation rate in invasion period, as long as over 0.5, contributes little to the preventing of epidemic; the improvement of isolation rate in latent period is the key to control the spreading of disease. next in fig. 3(g) , the influences of message quality and isolation rate are measured. for the same parameters as before, the higher value of θ corresponds to the lower value of i (t). it can be predicted by i ∼ −1/θ 2 . that is another conclusion is obtained here: even though a higher value of θ corresponds to the lower infection rate, once θ > 7 the effect is changed slightly. it implies that the disease is easy to control as long as the messages achieve a certain quality. in fig. 3(h) and (i), we display the steady-status behavior of the model on ba networks with different value of m. similar results are obtained by the experiments on 1000-nodes network and 10,000-nodes network. the clustering coefficient in scale-free network [26] is get by it is obvious that there is not obvious characteristic of clustering in scale-free network. m (t ) represents the average information quantity on the stabilizing time t . q (t ) = t t=1 i(t) denotes the ultimate scale of population in quarantine. they are used to measure the population immunity and magnitude of disease. as m changes from 1 to 10, the average information quantity presents exponential distribution, which illustrates that the flow of information is highly improved as the increasing of average degree ⟨k⟩. the proportion of isolators presents a fluctuating distribution. for m = 1, both m (t ) and q (t ) are sustained at around 0.2. at this point, despite the weak immunity of population, the epidemic will not expand for the low connectivity of network. in this part, we consider the model behavior on small-world networks. the networks are generated by ws algorithm, which are firstly distributed in an irregular circular shape and contain n nodes. every node connects to k nodes nearby, with k /2 in the clockwise and k /2 in the anticlockwise and n >> k >> ln (n) >> 1. later rewire each edge at probability p. change the parameter p from 0 to 1 then the generating network is translated from regular network into random network. self-loops or multiple edges between nodes are not allowed. clustering coefficient in ws small-world network [27] is given by the steady-status in ws network achieves the same level (q (t ) ≈ 0.65) as in ba network by setting p = 0.01 and k = 6, which can be studied from fig. 4(a) and (b) . by increasing the value of p, the randomness of network is improved. the networks with high regularity and low degree are hard to spread disease, e.g. the networks in fig. 4(a) with p = 0.01, k = 4; the increasing k contributes to the disease spreading theoretically but finally is balanced by the increasing value of m (t ); in fig. 4(c) , we find that in a limited range, the more irregular the network is, the more easy the disease spreads; finally in fig. 4(d) , the small-world network is transformed into a completely random network by setting p = 1. there seems to be no remarkable differences between results in fig. 4(b) and (d). it illustrates that the network topologies contribute little to disease spreading, as the generating parameter p is over 0.1. fig. 4(e) and (f) give a comparison about the model behavior on scale-free networks and small-world networks. results show that the infectious disease spreads more rapid in scale-free network. epidemics in small-world networks appear three characteristics corresponding to the number of infectors: slower growth, slower convergence and lower peak. in most studies, the meta-population networks are fixed. yet in actual world, the positions of individuals change every time. thus it is meaningful to consider the regularity of human mobility when analyzing the disease spread in real world. cho, myers and leskovec have made a research about this issue [28] . as they put forward, ''short-ranged traveling is periodic both spatially and temporally and not effected by the social network structure, while long-distance travel is more influenced by social network ties''. they have proved that social relationships can cover about 10%-30% of all human movements, while periodic behavior covers 50%-70%. we conclude their achievement systematically below as the experimental basis: • human activity presents the periodic prosperity and recession. the period is on a weekly basis. from monday to friday, human activity is highly consistent, i.e., moving from home to work station then going back home. on weekends, the activities are scattered and inconclusive. • 100 km is the typical human radius of ''reach'' as it takes about 1-2 h to drive such distance. because of the geographically non-uniform distribution over the earth, human tend to cluster in cities and the homes of friends at around 100 km present a similar kink. • user home locations are not explicitly given yet can be inferred by defining the home location as the average position of check-ins within the scope of 25 km. this method to infer home locations is demonstrated with 85% accuracy. likewise, we say that user a ''visits'' her friend b if a checks-in within radius r of b's home (set r = 25 km). • long-distance travel of people is usually influence by friends. the relative influence of a friend who lives 1000 km away is 10 times greater than the influence of a friend who lives 40 km away. • there is a 30% possibility that a user who travels more than 100 km from her home will pay a visit to an existing friend's home. moreover, the probability remains constant after the 100 km mark though the number of friends decreases with the distance. • ''check-in'' data is with a spatial accuracy of about 3 km. above all, individuals can be geographically divided into: the intimates (with a largest probability to be infected) within 25 km; the fellows (with a larger probability to be infected) within the range of 25-100 km; the potential contacts (with a medium probability to be infected) beyond 100 km but within 1000 km; the distant contacts (with small probability to be infected) beyond 1000 km. four types could transform into each other as distance changes. table 2 . in fig. 5 we show the geographic position (left panel) and relational network (right panel), where nodes are positioned using the geographic locations of mar. 2009. initially, all the individuals are distributed around the world, mostly in north america, europe, japan and new zealand. it seems like that people who lived in america, europe and japan tend to keep a close contact with each other. especially, there are most lines scattered from america to other districts. it is supposed that americans tend to travel around the world and live in foreign countries, meanwhile keeping a close touch with their internal friends. the mechanism of disease spreading and preventing is as follows. for node i, which has been infected, the first week is defined as latent period, when it is hard to observe the obvious symptoms of disease. the infected node would keep its periodic activities in this period, and most of neighbors within scope of 25 km would be infected unconsciously. until the next week (invasion period), the node would be isolated at a high probability of v 2 . if isolated, the node would deliver immune messages to its direct and indirect neighboring nodes. if not, it would continue to spread disease within the scope of 100 km. the process is derived from eqs. (2.2) and (2.3). without any prevention measures, the disease would spread all over the world within three months (fig. 6 ). it can be attributed to the convenient and fast movement of human beings. in fig. 7(a) , the red line describes the diseases which are infectious in both latent period and invasion period, while the orange line depicts diseases which are infectious only in latent period. making a comparison, the former is over 7.14% than the latter. after analyzing, we summarize two reasons to express the weak distinction: firstly, most neighboring nodes have been infected and the space of infection becomes smaller in invasion period; secondly, the high isolation rate reduces the number of infectors. the green line in fig. 7(b) represents the number of infectors while taking preventive measures. compare to the previous red line, number of infectors decreases over 68.7%. we display the details of disease spreading process in fig. 8 . green points represent the susceptible people; red points stand for the infected people; blue points denote both the isolators and immune people (who collect enough messages). source of infection is located in north america. to simulate the free spreading process of disease, there is no prevention measure in the beginning. as we can see from the first four panels, the uncontrolled disease spreads from america to the europe and asia successively. in fact, the prevention measures are started at the fourth week but remain with a weak and limited power. we attribute it to the latent period of disease, which is set to a week basis. later in the fifth and sixth weeks, the regional infectious disease evolves into a global outbreak. most of patients are experiencing the invasion period clinically and may be isolated at a very high rate. the immune measures show their great effects right now. the covering square of blue points is expanding. after nine weeks, the epidemic is under control completely. in this paper we provide a novel model about the control of infectious diseases and perform it on different networks. the main conclusions are as follows. first, in the model, two kinds of disease prevention measures are proposed: quarantine and message delivery. under the premise of medical research, we assume the rates of patients who are isolated in latent period and invasion period are different. besides, the specific messages sent from the isolators to their direct and indirect neighbors are originally proposed. information quality and information accumulating effect are also considered in the model. second, several valuable results are obtained by simulating the model in scale-free and small-world networks: (i) the increasing of isolation rate in invasion period, as long as over 0.5, contributes little to the prevention of epidemic; the improvement of isolation rate in latent period is key to control the spread of disease. (ii) infectious diseases are easy to control as long as the propagating message achieves a certain quality. (iii) in the scale-free network with a very low degree (⟨k⟩ = 2), the epidemic will not diffuse on the network despite the weak immunity in population. (iv) compared to scale-free network, epidemics in small-world network appear three characteristics corresponding to the number of infectors: slow growth, slow convergence and lower peak. finally, in general, the mobility and communication of population can be time-varying. based on the specific research about human periodic activity, we constitute a social network which has a certain similarity with the real world, in which the activity of human presents periodic trend and certain randomness. experimental results show that the proposed strategies have great effect on the control of disease. risk factors for human disease emergence the spread of awareness and its impact on epidemic outbreaks endemic disease, awareness, and local behavioural response disease containment strategies based on mobility and information dissemination modeling the dynamical interaction between epidemics on overlay networks asymmetrically interacting spreading dynamics on complex layered networks model for rumor spreading over networks complex dynamics in a singular leslie-gower predator-prey bioeconomic model with time delay and stochastic fluctuations spreading of periodic diseases and synchronization phenomena on networks influenza epidemic spread simulation for poland-a large scale, individual based model study effects of epidemic threshold definition on disease spread statistics analysis of the impact of education rate on the rumor spreading mechanism the mathematical theory of infectious diseases and its applications incubation periods of acute respiratory viral infections: a systematic review the distribution of incubation periods of infectious diseases extending the sir epidemic model design, synthesis and biological evaluation of functionalized phthalimides: a new class of antimalarials and inhibitors of falcipain-2, a major hemoglobinase of malaria parasite fruit bats as reservoirs of ebola virus marburg hemorrhagic fever measles virus for cancer therapy effectiveness of precautions against droplets and contact in prevention of nosocomial transmission of severe acute respiratory syndrome (sars) are we there yet? the smallpox research agenda using variola virus hospital outbreak of middle east respiratory syndrome coronavirus a contribution to the mathematical theory of epidemics emergence of scaling in random networks mean-field theory for clustering coefficients in barabási-albert networks on the properties of small-world network models friendship and mobility: user movement in location-based social networks this research is supported by the national natural science foundation of china (nos: 61370145, and 61173183), and program for liaoning excellent talents in university (no: lr2012003). key: cord-328858-6xqyllsl authors: tajeddini, kayhan; martin, emma; ali, alisha title: enhancing hospitality business performance: the role of entrepreneurial orientation and networking ties in a dynamic environment date: 2020-07-15 journal: int j hosp manag doi: 10.1016/j.ijhm.2020.102605 sha: doc_id: 328858 cord_uid: 6xqyllsl utilizing a sample of 192 hospitality firms, this study investigates the moderating role of a dynamic environment, coupled with business and social networking ties and technology resources, on the relationship between entrepreneurial orientation and organizational performance in hospitality firms. this research is novel in that we adopt business network ties and social network ties as two moderating variables along with technology resources between entrepreneurial orientation and business performance, providing evidence on a topic which has received little attention to date. the results posit that in an uncertain, dynamic environment a higher level of risk and entrepreneurial orientation benefit business performance especially when coupled with strong business and social networks. traditionally the performance of hospitality organizations has been studied from the point of destination brand/reputation (e.g., image, perceived service quality, price) (e.g., gomez et al., 2013) , attractions (e.g., adventure, architecture, natural resources and endowments) (e.g., alhemoud and armstrong, 1996; mussalam and tajeddini, 2016) , infrastructure (e.g., efficiency of transportation, shopping, sports facilities) (molina-azorin et al., 2010) , and services (e.g., quality and variety of accommodation, food and wine) (e.g., henderson, 2011; roxas and chadee, 2013) . however, recently a number of hospitality scholars (fu et al., 2019; moghaddam et al., 2018; omerzel, 2016; taheri et al., 2019; vega-vázquez et al., 2016) have seen entrepreneurship and innovation as having a critical role in shaping this global industry and have advocated for further research. entrepreneurial activities have been examined as an antecedent of growth, competitive advantage and superior performance. despite the rising interest in entrepreneurial activities, the contemporary hospitality literature proposes that the field is lacking of strong theoretical frameworks related to this essential entrepreneurial orientation (eo) (fadda, 2018; hernández-perlines, 2016; tajeddini, 2015) . moreover, the current uncertain and highly competitive marketplace leads hospitality firms to face various economic, financial and sociocultural problems to deliver superior value to customers (o'cass and sok, 2015) . indeed, research indicates that hospitality firms have encountered greater levels of risk and competition than other industries (kwun and oh, 2004) due to the crowded and homogeneous marketplace (morgan et al., 2014) , low entry and high exit barriers (lee et al., 2016) and price conscious customers (singal, 2015) . in response hospitality firms have learnt to embrace an entrepreneurial spirit and introduce new products or services (anderson et al., 2015; lee et al., 2016) moving towards a more decentralized and organic organizational structures (tajeddini et al., 2017) . the inherent multi-experiential nature of hospitality firms require an entrepreneurial mindset to seek, and capture, the opportunities to offer unique experiences (e.g., new services, new package holidays) to travelers de factor global in scope, hospitality is one of the main economics antecedents of a great number of nations (french et al., 2017) and entail a multitude of services, facilities and attractions that create many entrepreneurial opportunities (fadda, 2018) . nevertheless, research about how these organizations utilize entrepreneurial capabilities and competencies to engage in and benefit from eo remains under developed (omerzel, 2016; roxas and chadee, 2013) . in particular, tajeddini (2010) argues that research capturing eo in the field is largely phenomenological and underrepresented. subsequently, this is a contradiction in the hospitality field; on the one hand, the eo of these businesses is taken for granted, while on the other hand, empirical research that exploring eo in hospitality firms is lacking (taheri et al., 2019; vega-vázquez et al., 2016) . the heightened dynamic environment (achrol, 1991) calls for cooperation, partnership and strategic alliances (jiang et al., 2016) . resource sharing to create value (jiang et al., 2016) and the formation of partnerships to promote innovation and enhance financial return are emerging (xie et al., 2010) as can be seen with formalized overbooking collaborations. this bond of mutual trust with business partners and stakeholders is essential for the sustainable success and development of hospitality businesses. cooperative entrepreneurial firms' relationships develop over time wherein mutual trust and commitment are established between partners (adjei et al., 2009) . it becomes more evident in the hospitality industry where a sizable number of firms are micro, geographically fragmented and interdependent in nature, thus with constrained resources under uncertainties (ying et al., 2016) . the cooperation between these firms can reinforce decision quality, overcome impasse, strengthen bonds between stakeholders and offer a platform for developing formal and informal inter-organizational collaboration and partnerships, thereby they are characterized as a networked system (adongo and kim, 2018; gao et al., 2017; ying et al., 2016) . the nature of service business demands that organizations interact with their consumers and company partners utilizing cooperative networks to deliver safe, reliable, and professional services at the highest possible value across international borders (kandampully, 2002; o'cass & sok, 2015) . lechner et al. (2006) stress that networks in the service industry are predominantly vital for the expansion and accomplishment of innovative organizations and supply a noteworthy source of sustainable competitive advantage. the arrival of emergent players in the lodging services has put even more pressure on the tourism and travel industry (priporas et al., 2017) . this industry is rapidly changing. such dynamic and uncertain environments require hospitality firms to strengthen their abilities to be innovative, proactive and risk taking (priporas et al., 2017) . while previous studies (e.g., boso et al., 2013; hoang and yi, 2015; jiang et al., 2018) have all stressed the need for more empirical research in comprehending the relationship between eo and networking in today's volatile environment, this is even more important. continuing proliferation of technology and communication, and the ongoing emergence of new players in the market promise a more dynamic competitive environment (achrol, 1991) . hospitality firms are globalized through technology and communication. in today's digital era, entrepreneurial hospitality businesses have widely employed automated modern information technology and communication systems to promote for security (cyber-crimes) (bharwani and mathews, 2012) , free exchange of ideas, data and best practices (yeniyurt et al., 2005) , better education and skills development (azadegan et al., 2019 (azadegan et al., , 2020 and faster and easier business contacts without business trips (holjevac, 2003) . determining the factors to enhance business performance is fundamental particularly exploring the role the dynamic environment plays on the outcomes of expecting high performing behaviors (rashidirad et al., 2013) . what impact does networking have on the relationship between entrepreneurial strategy-making and firms' outcome? do firms need strong ties to other organizations, and social networks? this research scrutinizes the moderating role of the dynamic environment and network ties on the relationship between entrepreneurial strategy-making, long term growth and short term financial return. the paper aims to bring together these previously researched areas to explore the relationship between the component parts. in doing so highlighting when these factors have most impact on the hospitality industry. we do so due to the divergent views currently existing on the relationship with eo and a firm's performance. previous work ranges from positive (covin and miller, 2014; lomberg et al., 2016) to insignificant (covin and slevin, 1989; george et al., 2001) and thus comprehending and understanding the complexity of eo is pertinent. this paper proposes that in a dynamic environment the level of networking hospitality businesses undertake is an important component to eo and long and short term business performance. the work goes further as it also explores the relationship and influence of both business and social networks, empirically studying different combinations of high/low business and social network ties and the interplay of these network ties and the dynamic environment. the findings support hospitality managers in taking greater risks in dynamic environments and leveraging networks to realize enhanced performance. for this study, we define and measure business "performance" in two respects: growth and financial return. utilizing data gathered from 192 japanese hospitality firms, this research offers and examines plausible assumptions concerning the interactive impacts of eo, dynamic environment and networking on service company growth and financial return. entrepreneurial orientation, often referred to as entrepreneurial strategy-making, has been characterized as an attribute of management style that favors change and supports activities related to exploiting different forms of innovation, new product/service development and the creation of superior customer value (tajeddini and trueman, 2016) . when embedded within strategic decision-making, eo plays an important role in firms' developing, commercializing and aggressively pursuing new products and service development and anticipating and responding to contingencies (hernández-perlines, 2016; rauch et al., 2009) . eo is underpinned by distinct strategic orientation which collectively enhances business outcomes by creating new knowledge required for establishing new capabilities and re-energizing existing resources and capabilities, fostering an innovative mindset within the firm under differing turbulent and competitive environments (cavusgil and knight, 2015; jalilvand et al., 2018; martin and javalgi, 2016; taheri et al., 2019) . arguably, a business with a strong eo focuses on gaining a superior performance by building a value-creating strategy which other competitors are unable to duplicate the benefits, or find it too costly to imitate. therefore, entrepreneurship represents an organizational strategic orientation by foregoing profits in the short term and investing in higher risk opportunities for longer-run benefits and value creation. as a result, such firms proactively produce novel and innovative products or services, creatively outperforming rivals (hernández-perlines, 2016; martin and javalgi, 2016; miller, 1983 ) and earning an above industry-average compensation (entrepreneurial surplus) (mishra, 2017) . conversely, more risk-averse businesses appear more likely to pursue an incremental process putting greater emphasis on short-term success and financial gains determined by profitability and productivity. these non-entrepreneurial organizations prefer to imitate products and services rather than innovating themselves. their risk aversion is high and more likely they are market followers rather than leaders. the relationship between a firm's performance and eo has been well established in a variety of fields such as banking (e.g., niemand et al., 2017) , international businesses (e.g., balabanis and katsikea, 2003) , travel agency (taheri et al., 2019) and hospitality (hernández-perlines, 2016; tajeddini, 2014 tajeddini, , 2015 . all of these studies indicate that eo plays a positive role in an organization's overall performance. nevertheless, lumpkin and dess (1996) argue eo should be regarded as a contextspecific. more recently fu et al. (2019) stress that eo should be investigated and analyzed in different settings due to the variances in industry characteristics such as life cycle and dynamism. scholars have stressed the idiosyncratic nature of hospitality offerings and internationalization strategies which require particular industry-based attention (dorado and ventresca, 2013; fu et al., 2019) . the hospitality industry differs from the other industries where they focus more on lifestyle, personality, and culture (liu and mattila, 2017) . often hospitality firms are regarded as archetypal entrepreneurial industries (morrison et al., 2004) embracing qualities such as risk-tolerance, resource mobilization competences (narangajavana et al., 2017) , innovativeness and proactiveness (demary, 2017) , economic growth (webster and ivanov, 2014) , job creation, change, development and innovation (ball, 2005) ; new venture creation (andringa et al., 2016) . hospitality businesses are not protected by tradition, destinations are not able to stay still as demand for new experiences and repeat visits continually pushes for new experiences and continued drive to develop product and services. entrepreneurial activities may foster a firm's innovative capability with a consequential impact on financial return and business outcomes (hallak et al., 2013; schuckert et al., 2015) . previous studies have indeed shown that higher levels of eo are positively associated with improved hospitality and tourism firm performance (e.g., roxas and chadee, 2013) . in practice, the opportunity for hospitality service companies to take financial risks and proactively deploy resources to new opportunities can vary widely (de clercq et al., 2010; taheri et al., 2019) . this study posits that eo has a positive impact on hospitality and tourism organizations' performance in terms of long-term growth and short term financial return. hence, this research expects that: h1. the positive impact of eo on hospitality business performance increases with the degree of (a) growth and (b) financial return. dynamism of the environment can be conceptualized as the rate of change and the magnitude of unpredictability, e.g. alteration in technologies, variations in consumer preferences and market demands (tajeddini and mueller, 2019) . the impact of firm's resources, capabilities and competencies on a firm's behavior and operations are reliant on these environmental dynamism cues (koberg et al., 1996) . as competition increases and customer preferences change, a faster pace evolves and the environment becomes more dynamic. as a result, the development stage from product introduction to withdrawal becomes shorter. the introduction of new tangible and intangible goods is more frequent, information becomes outdated quicker (tajeddini and mueller, 2019) . thus, it is more complicated and difficult for firms to assimilate and predict environmental conditions, to discover the possible effects of innovative technological changes on consumer needs and behavior, and to explain them into specific and relevant activities (kabadayi et al., 2007) . for example, the arrival of airbnb, a fast-paced business gradually disrupting the hospitality industry and creating a dynamic and competitive new operating environment. such competitive pressure coupled with fluctuating tourism demand and low levels of service differentiation among lodging industry has created an intensification of pressure on management to devise novel approaches for beating rivals (priporas et al., 2017) . organizations effectively operating in a dynamic setting are more prone to succeed where the expenditure and level of risk related to innovation, can be retained by capturing new market niches (koberg et al., 1996) . in such an environment, companies are required to monitor marketing practices, leverage their relationship quality with customers, and undertake high levels of service/product innovation augmentation. this stimulates substantial tangible and intangible investments in innovation-related activities to improve existing products, or develop new products (adjei et al., 2009; nandakumar et al., 2010) . uncertainty is high about competitive products, market requirements shifts quickly and new product development is a more complex (hult et al., 2007) . thus, we assume, h2. dynamic environment strengthens the connection between eo and tourism business performance. network ties can be defined as the extent individuals, firms, management or entrepreneurs of the same network tie to each other. these can be either strong or weak. strong-tie relationships are described by frequent interaction between individuals, entrepreneurs and firms with similar interest. this tends to reinforce and develop insights and new ideas. weak-tie relationships, however, are characterized by infrequent interactions between casual acquaintances (barringer and ireland, 2016) . arguably strong network ties facilitate communication, cooperation, frequent exchanges of information and greater dissemination of knowledge across the organization. this results significant reductions in total costs (kraatz, 1998) , stable and balanced business operations (gavirneni, 2002) and improves innovation (goes and park, 1997) . according to lee et al. (2009) , strong-tie relationships are triggered by a sense of companionship, comfort, security and are part of a multilayered strategic relationship. network ties can be viewed from the point of business or social relationships. entrepreneurial firms utilize business networking as a vehicle to connect more with each other counterparts to enhance efficiency operation (barringer and ireland, 2016) . business network ties are recognized as key facilitators of the effectiveness and efficiencies of entrepreneurial strategic orientation activities to capture emerging business opportunities (barringer and ireland, 2016; li and zhou, 2010) . as menor and roth (2008) stress, world-class organizations are able to create dynamic processes through strong network ties that foster accelerated information flows along with other capabilities providing sustainable competitive position in the global market. there has been research on the role of eo in a dynamic environment (covin and slevin, 1989; wiklund and shepherd, 2005) , the mediating effects of network resource acquisition (jiang et al., 2018) information orientation (keh et al., 2007) , and technology action (choi and williams, 2016) . recently, using network theory, jiang et al. (2018) find that the acquisition of resources from a firm's networks (business and government) is a mechanism by which eo enhances a firm's performance. these scholars focus on the mediating role of the network in resource acquisition. using networks as a mediating variable in the external environment, whether the influence is increased in a dynamic environment is key to this study as the speed of change hospitality firms are experiencing is unprecedented. entrepreneurial firms have a propensity to be more innovative where information flows establish an arrangement and relationship for quick interactions between agents trueman, 2014, 2016) . business networking has been well documented particularly the networks contribution to destination development (cf. heidari et al., 2018 . kelliher et al., 2009 kim and shim, 2018; tinsley and lynch, 2001; welch and wilkinson, 2002) and knowledge exchange within networks are seen as the creation of learning communities. indeed, as kelliher et al. (2009) point out tourism development agencies have specifically facilitated these business networks in small hospitality firms to proactively help these entrepreneurs learn and regions develop. arguably, the actor bonds, resource ties along with activity links and within entrepreneurial hospitality firms may evolve in a single dyadic relationship, connecting to a wider web of actors, activity patterns and resource constellations influencing the business network (gao et al., 2017; tinsley and lynch, 2001) . thus, we propose, h3. business network ties strengthen the connection between eo and tourism business outcomes. as lien and cao (2014) report, leveraging social contacts also enables organizations to enhance business performance. social networks are, hence, also critical sources of information for individuals and organizations. according to statista (2017) , the number of social network users has increased substantially from 2.14 billion in 2015 to 2.46 billion in 2017, and is anticipated to reach 3.02 billion by 2020. social network relationships can be defined as the individual's social ties with other social actors such as friends, family, colleagues, customers, clients, or managers with the similar interest. the emergence of informal social networks has led to an unprecedented level of information sharing (kivinen and tumennasan, 2019) . social media applications are effective for diffusion of information, interpersonal ties that provide sociability, support, information, a sense of belonging shared personal opinions, thoughts, experiences and social identity (alalwan et al., 2017; dickinson et al., 2017; golzardi et al., 2019) . the critical role of social networking has broadly been discussed in marketing practice and an entrepreneurial venture due to its nature of sharing information and operational interactions (engel et al., 2017) . these firms have utilized social networks as a vehicle to create more effective promotional strategies and aspirational branding where customer's interactivity, involvement and relationships are stimulated and their experience is shared. entrepreneurs build social capital through social networks to benefit from social participation. social media applications are represented in virtual platforms (e.g., instagram, flickr, linkedin, you-tube, digg, google reader, facebook, twitter) and have become an indispensable means for the tourism and hospitality industry by transforming travelers from passive to active co-producers of experiences about peer-to-peer accommodation and tourism service recommendations (ge and gretzel, 2018; xiang and gretzel, 2010) , thereby they influence business performance (chung et al., 2017) . thus, we hypothesize, h4. social network ties strengthen the connection between eo and service business outcomes. technology resources in organizations refer to process-specific informatics technologies that are employed to support particular processes (ray et al., 2004) . successful firms require not only capabilities in the areas of corporate, business and functional planning and strategy, comprehensive financial projections, and resources allocation, but also information technology resources and capabilities (e.g., technical it skills, knowledge, infrastructure) for operational processes. the technology resources employed in service industry include networks with representatives, web-enabled customer interaction, computer-telephone integration (cti) among others (cohen and olsen, 2013) . although technology resources are valuable, they can be duplicated easily with insignificant cost and may not affect directly to the business performance (hadjimanolis and dickson, 2001) . thus, they may not directly influence business performance. given the possible benefits of technology resources yet along with other organization's heterogeneous resource portfolios (e.g., skills, knowledge, capitals, and technology resources) as valuable, it is somewhat surprising that research has not extensively explored moderating impact of technology resources on the possible link between eo and service business performance (cohen and olsen, 2013) . in an entrepreneurial firm, however, they might affect on the association between eo and service performance (fig. 1) . hence, h5. technology resources strengthen the positive association between eo and service business performance. japan was selected as the location for the data collection due to the substantial growth rate of the tourism industry. tourism is increasingly becoming vital for japan's economy (honma and hu, 2012) with a continuous rising trend in average day to day rates since 2012 (sawayanagi et al., 2014) . according to a published report by mckinsey japan and travel, from 2011 to 2015, japan's inbound tourism grew by 33% a year (andonian et al., 2016) . in 2016, the gross domestic product (gdp) contribution of tourism in japan was usd110.5bn, (2.4% of gdp), generating 4,474,000 jobs (6.9% of total employment) and estimating 7% of total employment (oecd, 2017). according to world travel and tourism council (2018), japan invested in travel and tourism around jpy3,739.6 bn, 3.5 per cent of total investment (usd34.4bn) in 2016 forecasted to maintain 4,854,000 jobs (7.6 per cent of total employment), an increase of 1.0 per cent pa over the period and visitor exports generated jpy3,521.7bn (usd32.4bn), 4.4 per cent of total exports in 2016. this growth was driven by various factors such as abenomics (i.e. reform in economic and financial policies coupled with governmental reforms), depreciation of the japanese yen (cf. tajeddini et al., 2020) , sustaining the tourism and hospitality industries by the japanese governments' programs which includes the formation of the japan tourism agency (jta) in 2008, the launch of the visit japan campaign, the relaxation of visas for tourists and increased advertising (andonian et al., 2016) . as a result of these efforts, the country enjoyed a 5th straight record year of boosted arrivals and tourism spending in 2017. japan's policy makers have set an aspirational target to increase inbound tourism to 40 million in 2020, triple the annual number of visitor nights in non-metropolitan areas from 2015 to 2020 (andonian et al., 2016) . the government has supported up to 50% of construction costs for new hotels and has suggested the state-run japan finance corp along with other loan lenders to provide suitable financial support to such accommodations to help you through the renovation process smoothly in line with the state's plan (reubi, 2017) . the flow of foreign prominent brand hotels has also contributed substantially to the high growth rate of hotels in japan (jetro, 2009). our knowledge of the roles of entrepreneurial activities and networking in gaining and sustaining competitive advantage has been gained predominantly through a traditional hypothetico-deductive approach (cf. tajeddini and mueller, 2012) utilizing survey questionnaires or brief interviews. however, to avert the researchers from drawing inconsistent conclusions (creswell and plano clark, 2007) and to better understand the nuances of the concepts and their relationships, we commenced with a qualitative pilot study which was then followed with a quantitative survey of hospitality firms. this pilot study prior to the survey was important for three reasons. first, it helped in refining the items/activities included in the questionnaire (avlonitis and papastathopoulou, 2001; kim, 2010; sampson, 2004; yin, 2014) . secondly, it pre-empted possible challenges that may have occurred in the data collection and analysis process (arain et al., 2010; kim, 2010) by providing a deeper understanding of the japanese context and lastly it revealed that building relationships through networking is crucial. the interviews explored the prevalence of entrepreneurial orientation, network ties, and dynamic environment is among tourism service hospitality firms. thus, a series of six semi-structured face-to-face interviews was carried out with higher-ranking tourism executives on service firm premises in tokyo. each semi-structured interview was informal, starting with general questions about experience and professional background and gradually elaborating with respondents on k. tajeddini, et al. international journal of hospitality management 90 (2020) 102605 specific aspects of entrepreneurial strategic orientation and networking ties. instead of simply rolling the planned conceptual theory relationships, these interviews explored what these tourism service managers perceive about the meaning and domain of our constructs from the literature and to estimate the face validity of the measurement scales used in the survey. some typical questions were: 'how do you seek opportunities?', 'how do you make decisions on investments that might involve risk? 'do you regularly seek to introduce new products?', 'do you change your marketing practices and strategy rarely or extremely frequently? 'do you have any good connections with your business partners (i.e. customers, competitors)? 'do you use social media to disseminate information and to engage with influential people in your industry?' all interviews lasted approximately 60 min and were transcribed verbatim for maximum comparability. transcripts were broken down into separate parts and meaning units were used to discover the important segments of the text (cf. nowell et al., 2017) . during the analysis, emerging themes were identified and compared. overarching themes were extracted through keywords seeking to encompass a broad array of sub-themes that fall under the topics to capture repetitions (braun and clarke, 2006) . three key themes emerged from the pilot interviews which were entrepreneurial dimensions, networks and the environment. respondents indicated a lack or weak tendency towards proactivity and risk taking with one interviewee stating "our limited budget does not allow us to be the first mover and often times we assess the market to adopt something new to offer". networking was frequently emphasized by respondents who indicated that sociality and interactions with customers, suppliers, and competitors were important to their business. this is affirmed by the following quote "it is unimaginable not to use social networking to understand our position in the market, our customers' perceptions, their needs and desires". for example, in japan it is very common to attend nomikai (drinking party) for professional networking purposes with colleagues and clients. during nomikai, people express themselves and use it as a vehicle to talk about their personal and business lives and exchange their business cards to announce their identities showing a sense of belonging. regarding social networking, the respondents stated that it is very important for them to attach to their communities though joining different group activities and community services. such undertakings help them build trust and cement relationships. during the interviews, several informants established the significance of the two networking ties (business and social). these interviews also confirmed the important of the business environment as evidenced form the following quote "we regularly observe our competitors and update our promotions methods accordingly". appendix a provides an overview of other key quotes for these three key themes. this element of the study was effective to refine the themes by revisiting the concepts, ratifying interpretations and provided a better understanding of the possible relationship between the variables. the sampling frame for the quantitative study consisted of 500 japanese hospitality firms (e.g., hotels, resorts) randomly selected from different regions in tokyo. back translation was performed to ensure a rigorous verification process for translation validity and conceptual consistency. after pre-testing the scale items with two japanese academics, a pretest was conducted with 60 managers from 30 service firms (2 managers each) in tokyo. this procedure ensured the clarity of the japanese version of the survey items, certified the quality of the research design and minimized any difficulties with the questions for the respondents. to address potential social desirability bias and to amplify the enthusiasm of key respondents, explicit promise for anonymity and confidentiality was guaranteed. over a period of five months, a total of 470 remaining hospitality and tourism service organizations (two surveys per firm, 940 questionnaires) were sent by postal mail accompanied by self-addressed and stamped return envelopes. two survey questionnaires per organization were used to mitigate common method bias and to diminish the particular source bias. the top managers, owners and chief executive officer were asked to introduce one or two more informants from their firms, who were the most experienced and knowledgeable about the firm's operations, service innovation process, firm performance to fill in the survey questionnaires. the first informant (recognized as manager or owner) evaluated business performance, entrepreneurial orientation and environmental dynamism. the second informant assessed networking ties and technology resources as well as the firm's relationship environmental dynamism and entrepreneurial orientation (both respondents responding to entrepreneurial orientation and environmental dynamism allowed for investigation of consistency within the firm). the above two sources of information were then merged as one dataset per tourism firm (i.e., unit of analysis). to enhance the motivation of respondents, explicit assurance was given that no individual responses would be revealed by the research team and individual or organizational identifying factors would be removed. no explicit incentive was offered and a total of three reminders were utilized. as a result, twenty one firms responded with only one survey and since we inquired for multiple responses from companies, we removed these organizations' surveys from our final assessments. a series of 28 phone calls were made to respondents to assure key informant quality. of the 227 responses (i.e., firms), 14 completed surveys were eliminated because of extreme missing data. a final sample of 192 firms (40.85% response rate) was achieved. t-tests were employed to early and late respondents to verify any potential issue with nonresponse error. different variables such as firm age, firm size, proactiveness, risk taking and innovativeness were incorporated to evaluate t-values. the findings of t-values were between 0.18 and 0.53, demonstrating no substantial distinctions between these two clusters (p > .05), thus the possibility of a non-response error was negligible. to determine eo, the nine-item measurement was adopted from covin and slevin (1989) entailing three elements of strategic posture: innovation, risk-taking and proactiveness. past studies operationally delineated entrepreneurial strategy-making as an aggregate scale of three components (miller, 1983; van doorn et al., 2017) . these multidimensional variables reveal top management's behavior in making strategic decisions to shape a company's future and involve generating long term goals and plans to achieve them (see table 1a ). environmental dynamism mirrors the rate of any alteration or transformation in organizational design and structure, consumer inclinations, equipment, competitors' action, rules, policies and regulations, as well as the surrounding and environmental factors (tajeddini and trueman, 2016) . five opposite established items taken from khandwalla (1977) , to assess firms' environmental conditions were used. these items point out the level of alteration in management practice, rate of products obsolescence, forecast of business rivals' actions and anticipating the consumer preference and production modes. participants were requested to indicate whether their organizational external environment was stable vs. dynamic and predictable vs. unpredictable (see table 1b ). to measure social network ties, we used the three-item scale suggested by shane and cable (2002) . this reflected executives' social ties and professional relationships with the other agents. finally, to assess business network ties, we borrowed the four-item scale suggested by lau and bruton (2011) . the scale assesses the degree to which organizations cooperate with business counterparts together with suppliers, consumers, distributors and rivals (table 1b) . while objective performance measures are far more desirable than subjective performance measures, we were unable to access the hard financial information partly because managers were not willing to reveal the information. despite these challenges, prior research has k. tajeddini, et al. international journal of hospitality management 90 (2020) 102605 documented a strong correlation between subjective responses and objective measures (jaworski and kohli, 1993) . thus, we used the sixitem scale to evaluate a company's growth (as a proxy for long-term performance) and financial return (as a proxy for short term performance). informants were requested to assess these facets over the last 3 years relative to their main rivals. to measure technology resources in customer service, a six-item scale was adopted from (ray et al., 2004) to evaluate the range of the technology resources/applications deployed to endorse the process of customer service. fornell and larcker (1981) recommended two criteria should be met for convergent validity. firstly, the factor loading of each item of the scales should be significant and over .7 and to minimize measurement error, the average variance extracted (ave) should be over .5. as table 1a and table 1b show, each factor loading is significant at the 5 per cent significance level ranging from .71 to .96, resting above the recommended level of .7, whereas the lowest ave of the constructs is .63, showing convergent validity. composite reliabilities (cr) were performed to estimate the amount to which items was free from random error. the cr and ave of all constructs point out above the suggested cutoff (ave > .5 and cr > .7) (tables 1a and 1b) . chi-square difference tests was conducted (7×6/2 = 21) for all major constructs (i.e., eo, dynamic environment, social and business networks, financial return and growth) in pairs to verify if the constrained model presented was significantly worse than the unconstrained model. chi-square difference tests supported the unconstrained models for each pair of constructs (e.g., test for dynamic environment and social network ties δχ 2 (1) = 12.41, p < 001 and exceeded the critical value (δχ 2 >3.84). heterotrait-monotrait (htmt) method measures (voorhees et al., 2016) were computed. the results of htmt tests with all values were below the 0.85 threshold (i.e., the htmt between business network ties and environmental dynamism was .72), further supporting the notion of discriminant validity. seven distinctive controls were adopted for this study to account for their impact on the dependent variable. for further analysis, logtransformation of firm age, firm size, and respondents' experience were used. the number of staff was used to evaluate the size of the organization; a firm's age was assessed based on the number of years since the establishment of the firm; and the participant's experience was assessed by the number of years that he or she was in a similar business. dummy variables were included (1= service, 0 = other industries) to evaluate industry type; (1= hotels and resorts, 0 = other service enterprises) to assess the firm type; (1= international business, 0 = domestic business) to estimate the firm ownership and (1= other, 0 = tourism/ hospitality) to approximate the participant's background. following the recommendation of venkatraman (1989) , two distinct sets of calculations were employed for the confirmation of unidimensionality and convergent validity. in the beginning, we computed the estimated correlation between a certain item and the latent construct it represented. the results indicated that the z-values of the coefficients of all the estimates were significant at the 5 per cent significance level (z-values>± 1.96, p < .05). second, we examined the measurement model to evaluate the possibility of its fit to the data observing a chi-square (χ 2 ) examination and adjunct fit indexes. table 2 indicated that all the factor loadings of all the latent variables were significant (t>2.0). furthermore, the shared variances between pairs of all probable variables signified that the aves were higher than the related shared variance in all cases, further supporting the notion of convergent validity. to examine the likelihood effect of multicollinearity between the interaction effects, mean centered each measurement scale was assessed to represent an interaction term (the moderators along with the independent variable) and generated the interaction terms by multiplying the relevant mean-centered scales. multicollinearity between the variables could influence the results. thus, two multicollinearity tests were performed: variance inflation factors (vifs) and condition indices (cis). the largest vif emerged from loading fixed to 1 for identification purposes. scale: 1=strongly disagree; 7=strongly agree. (2) model summary statistics: χ 2 (13) = 19.982, χ 2 /df = 1.537, p-value = 0.096, robust cfi = 0.986, rmsea = 0.053, delta2 = 0.987, a loading fixed to 1 for identification purposes. cr = composite reliabilities, ave= average variance extracted. scale: 1 = not at all to 7 = to a large extent. (3) model summary statistics: χ 2 (9) = 28.499, χ 2 /df = 3.167, p-value = 0.001, robust cfi = 0.976, rmsea = 0.097, delta2 = 0.976, a loading fixed to 1 for identification purposes. cr = composite reliabilities, ave= average variance extracted. scale: 0 = don't intend to implement; 1 = not yet begun; 3 = standard/common implementation; 5 = highly advanced implementation. k. tajeddini, et al. international journal of hospitality management 90 (2020) 102605 the interaction between eo and environmental dynamism, with a value of 2.371, below the 10 benchmark. condition indices (cis) were examined utilizing the square roots of the ratios of the largest eigenvalue to each suggestive eigenvalue. the maximum condition indices extracted showing that all were less than 9.816. the maximum condition number was lower than both stringent (15.0) and lax (30.0) threshold values (see belsley, 1991) , thereby multicollinearity was unlikely to be a problem in our empirical data. to diminish any common method variation, the scale items were cautiously observed to determine that they were straightforward, explicit and short. a harman's ex post one-factor examination was run to present a further verification for cmv. the factor analyses showed that nine factors had eigenvalues greater than 1.0, explained by 71.705% of the total variance. factor 1 explained 25.340% of the variance (less than the most of the total variance), cmv seemed not to be an issue (podsakoff and organ, 1986; tsai and yang, 2014) . furthermore, a onefactor model was computed to compare with the measurement model. the results showed chi square = 734.15 with 198 df indicating there were no issues with cmv. furthermore, cmv was calculated by incorporating a marker variable (mv) in the survey questionnaire between the independent variable and the other independent variables (see lindell and whitney, 2001) . a six-item socialization scale to measure 'socially desirable responding' was adopted from strahan and gerbasi (1972) to serve as a proxy of the mv. this scale was selected because it had no theoretical association with any of the items employed in the current research. the measurement scale of 'socially desirable responding' indicated satisfactory reliability (cronbach's alpha = .81). the items for the measurement scale of socially desirable responding (1=strongly disagree and 7=strongly agree) include: (1) there have been occasions when i took advantage of someone; (2) i sometimes try to get even rather than forgive and forget; (3) at times i have really insisted on having things my own way; (4) i like to gossip at times; (5) i have never deliberately said something that hurt someone's feelings; and (6) i'm always willing to admit when i make a mistake. lindell and whitney's (2001) recommendation was followed to construct the adjusted correlation to avoid capitalizing on chance. in doing so, the second lowest positive correlation (r m = .02) between 'socially desirable responding' and one of the other variables was selected. in order to examine the adjusted correlations and their statistical significance, the subsequent equations proposed by yannopoulos et al. (2013) were used: where: r ij = the original (i.e., the pre-adjustment) correlation between constructs i and j; r m = the marker variable adjustment (i.e., the second lowest positive correlation between the marker variable and one of the other variables); r jm = the adjusted correlation; and t α/2 , n-3 = the t-value of the adjusted correlation. the marker variable adjustment does not alter the sign and significance level of any correlation coefficients proposing that the intercorrelations presented in the framework are improbable to be inflated because of cmv. additionally, socially desirable responding is also incorporated as a control variable in the hierarchical moderated regression analysis to reduce any cmv concerns. the details of the intercorrelations between the pre-adjustment and the post-adjustment of the constructs can be found in table 2 . the data revealed that a number of control variables were correlated whilst other variables showed only modest levels of correlation (table 2) . for instance, the correlation between size and eo (r = −.184, p < .05) recommends that in smaller hotels and resorts, it is more likely that firms are prone to be more innovative, proactive and willing to take risk. in addition, the positive and significant correlation between firm type and business network ties (r = .24, p < .01) stresses that hospitality firms enjoy business network ties. the positive and strong relationship between technology resources and a firm's size (r = .259, p < .01) indicate that larger firms benefit from technology resources in customer service. residuals were detected for linearity and homoscedasticity after each step of analysis and no violations were found of these postulations. the proposed framework composes of interaction terms between eo and environmental dynamism, networking ties and technology resources. a moderated regression analyses served to examine the proposed assumptions (tabachnick and fidell, 1989) . a stepwise regression was carried out to estimate the explanatory power of each group of variables. we performed two separate series of seven successive k. tajeddini, et al. international journal of hospitality management 90 (2020) 102605 regression models which evaluated the changes in the amount of variance explained (δr 2 ) to observe the interaction effects, and established overall and incremental f value analysis of statistical significance (see table 3 ). model 1 incorporates of the control variables whilst model 2 is composed of the direct impact of entrepreneurial orientation, environmental dynamism, business network ties, technology resources and social network ties. models 3-6 entails the four interaction effects, one at a time, to diminish multicollinearity issues, covering of accurate interaction impacts and increasing the interpretability of the regression coefficients (cohen et al., 2003) , as represented in earlier eo research that examine multiple interactions. model 5 comprises of the four interaction terms concurrently along with the marker variable. table 3 shows the control variables accounting for 3.3% and 6.3% of the total variance in growth performance (f-value = .904, ns) and financial return (f-value = 1.754, ns) respectively. incorporating the main independent and the moderator variables increased the r 2 value for growth and financial return and are explained by a substantial augmentation in fit statistics compared to model 1 and model 8 respectively. h 1 posits that the degree of eo would positively affect business growth and financial return. as models 2 and 9 of table 3 show, the degree of eo was positively related to growth (β = .288, p < .001, model 2) and financial return (β = .212, p < .01, model 9), implying that h 1a and h 1b are supported. although not hypothesized, model 2 in table 3 shows that environmental dynamism was positively related to business growth (β = .271, p < .001, model 2) and financial return (β = .332, p < .001, model 9). business network ties was also found to have positive effect on growth (β = .280, p < .01, model 2) and financial return (β = .418, p < .001, model 9). however, as seen in table 3 , there is no indication of a direct effect of social networking ties (β = 0.85; ns, model 2) on growth performance. we also observe no indication of the direct impact of social network ties (β = 0.027; ns, model 2) on financial return performance. in line with h 2 through h 5 , we included the interaction terms. as shown in table 3 , the positive entrepreneurial orientation × environmental dynamism (i.e. eo × ed) interaction term indicates that the positive relationship between eo and environmental dynamism becomes attenuated at higher levels of growth (β = 0.065; p < .001, model 3) and financial return (β = 0.025; p < .05, model 10) supporting h 2a and h 2b respectively. to clarify the nature of these interaction terms, we followed the recommendations of aiken and west (1991) and plotted the relationship between eo and environmental dynamism and performance (growth and financial return) at high and low levels of a dynamic environment (fig. 2, panel a and b, respectively), coupled with a simple slope examination for each. fig. 2 , panel a illustrate that the positive relationship between eo and growth performance become significant at high levels (simple slope = +.31, t-value = 4.02, p < .001) versus low (simple slope = +.32, t-value = 4.01, p < .001) levels of a dynamic environment. fig. 2 , panel b depicts the positive association between eo and growth becomes significant at high (simple slope =+.29, tvalue = 3.49, p < .01) versus low (simple slope = −.28, tvalue = −3.97, p < .01) levels of a dynamic environment. we found support for positive eo and business network ties interaction term (i.e. eo × bt) at higher levels of growth (β = 0.090; p < note: unstandardized regression coefficients are reported. *p < .05; **p < .01; ***p < .001 (two-tailed test). δr 2 means the increase in r 2 from the model to the previous model. k . tajeddini, et al. international journal of hospitality management 90 (2020) 102605 .001, model 4), but the relationship between eo and business network ties and financial return is not significant (β = −0.009; p > .05 (ns), model 11). thus, we affirm h 3a but not h 3b . fig. 3 , panel a demonstrates that the positive association between eo and growth become significant at high (simple slope = +.28, t-value = 2.82, p < .01) versus low (simple slope = +.21, t-value = 2.18, p < .05) levels of business network ties. fig. 3 , panel b depicts the positive relationship between eo and financial returns is significant at high (simple slope = +.29, tvalue = 3.49, p < .01) versus low (simple slope = −.28, t-value = −3.97, p < .01) levels of social network ties. h 4 postulates that business growth and financial return performance will increase when eo is complemented by social network ties (sn). as shown in table 3 , the positive entrepreneurial orientation × social network ties (i.e. eo × sn) interaction indicates a positive relationship between eo and social network ties becomes attenuated at higher levels of growth (β = .045, p < .01, model 5) and financial return (β = .062, p < .01, model 13) supporting h 4a and h 4b respectively. fig. 4 , panel a reveals that the positive relationship between eo and financial return become significant at high (simple slope =+.20, t-value = 4.73, p < .001) versus low (simple slope = +.03, t-value = 0.44, p > .05 (ns)) levels of social networks. fig. 4 , panel b exhibits the positive association between eo and financial return become significant at high (simple slope = +.24, tvalue = 5.70, p < .001) versus low (simple slope = +.10, tvalue = 0.44, p > .05 (ns)) levels of social network ties. h 5 posits that growth and financial return performance will increase when technology resources (tr) are complemented by eo. the interaction between eo and technology resources was found to be statistically insignificant on business growth (β = .014, p > .05 (ns), model 6), but significant on financial return (β = .062, p < .01, model 13). therefore, h 4a is not supported while h 4b is supported. fig. 5, panel b shows that the positive relationship between eo and financial return becomes significant at high (simple slope = +.34, t-value = 4.82, p < .001) versus low (simple slope = +.23, t-value = 3.31, p < .001) levels of technology resources. previous studies (e.g., zahra and hayton, 2008) suggest that the simultaneous inclusion of various interaction effects that share common variables might constitute methodological issues such as multicollinearity. to avoid this issue on one hand and to identify the exact effect of true moderating terms on the interrelationships of variables and constructs on the other hand, the interaction terms were included separately step by step. we also carried out a series of post hoc analyses to examine the robustness of the findings and discover possible alternatives. in doing so, all four interaction terms simultaneously included along with mv (model 7 and model 14). while the results indicate that the interaction terms (eo × business network tie (bt); eo × social network ties (sn)) on growth performance were positive and significant, as expected, the interaction terms (eo × dynamic environment (ed); eo × technology resources (tr)) were insignificant. similarly, the interaction terms (eo × ed; eo × sn; eo × tr) on financial return performance were positive and significant, as expected, but only the interaction (eo × bt) was insignificant (p > .05). the results indicate a consistency and stability of the signs of the interaction effects in both the comprehensive models (i.e., model 7 and model 14) and the models that contain the interaction effects separately, further supporting the notion of robustness (covin et al., 2006) . furthermore, we examined the possibility of curvilinear effects of business and social networks in line with arguments that organizations with high business network ties benefit from possible cost reduction, economies of scale, effective targeting of marketing strategies and tactics which may smooth the progress of business success (boso et al., 2013) . a regression analyses was conducted incorporating the corresponding quadratic effects along with the two-way interaction effects. the results show that the curvilinear terms were insignificant, which increases our confidence k. tajeddini, et al. international journal of hospitality management 90 (2020) 102605 that the proposed assumptions that the observed significant interaction terms strictly replicate the suggested theoretical moderating terms the research examined the role of a dynamic environment, networking and technology resources, on the relationship between eo and organizational performance. utilizing the data gathered from japanese hospitality firms, the findings clearly identified that in uncertain, dynamic environments, a higher level of risk and entrepreneurial orientation benefited business performance especially when coupled with strong business and social networks. the findings also suggest that smaller hospitality businesses have a higher eo as they are more inclined to innovation and risk-taking. networking ties also have positive effect on growth. this research is timely for the hospitality industry because it developed and tested an empirical model for explaining the relationship between dynamic environment, networking, technology resources, entrepreneurial orientation and organizational performance. previous research has examined only elements of these relationships (cf. ghantous and alnawas, 2020; jogaratnam and tse, 2006; kallmuenzer and peters, 2018; majid et al., 2019; rotondo and fadda, 2019; teixeira et al., 2019) . building on the data from japanese hospitality firms, this research has made innovative contributions by extending the knowledge on eo in the hospitality industry in a number of ways. first, the results of this research addresses the inconsistencies in the existing empirical investigations into eo and business performance by confirming that eo positively influences short term financial return and long term business growth in creating and building value for hospitality firms (kallmuenzer et al., 2019) validating that the eo theory is relevant for the hospitality industry. previous research in eo in hospitality (oktavio et al., 2019; vega-vázquez et al., 2016) highlight that eo tend to display negative or insignificant influences on business performance. these findings are significant because they assert that if hospitality businesses experiment with alternative offers, are more creative, take risks and are receptive to exploring novel products and new customers, they are more likely to succeed. to improve the performance of their business and for the longevity of the industry (roxas and chadee, 2013; hernández-perlines, 2016) growth-oriented hospitality firms should display higher levels of eo. as hypothesized, the findings revealed that the effect of eo on business outcomes is enhanced when operating in a situation with strong business and social network ties. this finding is novel as there is little empirical research on the relationship between networking and eo (jiang et al., 2018) and in hospitality there is also little literature on this topic despite the significance of collaboration for innovation of the industry (cf. marasco et al., 2018) . business networking ties are significant in helping hospitality firms to understand intra-firm and interfirm collaboration, the institutional contexts and the provision of timely and accurate business information. these networks can improve their financial performance by enabling smaller and independent businesses to compete more effectively (rotondo and fadda, 2019) as networking ties along with information sharing, and communication is a key avenue of competitive advantage (achrol and kotler, 1999; french et al., 2017; strobl and kronenberg, 2016) . networking is also important for those hotels which are located on the periphery as this enables them to learn and develop new knowledge, and facilities lowering risk of dealing with change and pursuing new opportunities. the period of writing this paper coincided with the global pandemic of covid-19 which has already had detrimental impacts on the hospitality k. tajeddini, et al. international journal of hospitality management 90 (2020) 102605 industry. this positive relationship between business networking ties and financial performance indicates that furthering these collaborations can be an opportunity to wrestle such challenging situations (brass et al., 2004; rotondo and fadda, 2019) . thus, proactively networking is a benefit for hospitality businesses especially in changing times. these results are reminiscent of the notion of how interaction with external sources of knowledge and information (i.e. business network ties) can support tourism businesses process-related activities to find novel and effective solutions for their operations. while the quantitative research does not confirm the direct impact of social networking on short and long term performance, the findings from our pilot qualitative work showed otherwise (see appendix a). respondents agreed that the social networking ties are important for the success of the business. indeed, with the exception of some shortcomings (e.g., lazzarotti and manzini, 2009 ), using socio-metric (i.e. social network) items measuring collaboration in the organization, the findings reinforce prior hospitality (rotondo and fadda, 2019; strobl & kronenberg, 2016; teixeira et al., 2019) and marketing and management studies (cambra-fierro et al., 2011; lechner et al., 2006) that social network ties facilitate organizations yield and enhance performance while developing their market and orientation. our findings deepen the existing research on eo in hospitality (hernández-perlines, 2016; jogaratnam, 2017; jogaratnam and tse, 2006; kallmunzer and peters, 2019; vega-vázquez et al., 2016) and in management (covin et al., 2006; lumpkin and dess, 1996) by confirming that eo is a multidimensional concept. this finding is significant for hospitality businesses because it allows them to have a more strategic approach to resource distribution based on targeting where it is leads to business benefits rather than allocating to each of the dimensions of eo. additionally, these results corroborate existing literature that eo must be understood from the contextual characteristics from which it is studied (lumpkin and dess, 1996; wales et al., 2019) . this gap in knowledge is also noted in the hospitality literature (fu et al., 2019; njoroge et al., 2020) . the positive relationship between eo and financial performance may be explained from the japanese context and the critical importance of networking as part of their business culture. these earlier varying results were possible because it is assumed that the indicators of eo were developed for western business contexts (njoroge et al., 2020) . our findings are significant because it affirms that for researchers to provide a developed understanding of eo in the hospitality industry, that the context must be considered as eo may vary based on this. the findings also revealed that the more dynamic the environment the better the return from an eo and smaller hospitality firms are more likely to be entrepreneurial and less risk adverse to larger firms, business networks are highly prevalent in the tourism industry and social networks help the organization when it is growing fast. in hospitality, risks are always present (williams and baláž, 2014) . given the dominance of small businesses in hospitality, the findings are significant in encouraging small businesses to be less risk adverse for improved performance. in hospitality, a higher risk tolerance leads to better outcomes for smes (martinez-roman et al., 2015) and leads to better performance in a dynamic environment (kreiser and david, 2012) . these outcomes are noteworthy given that previous results on small businesses in tourism resulted in inconsistent returns on risk taking did not find an inclination for risk taking (kallmunzer & peters, 2018; memili et al., 2010) . this higher level of risk inclination may be explained by the dynamism of the environment within which these japanese firms are operating in as miller and friesen (1983) indicated that the level of risk is environment dependent. lastly, our empirical results suggest that the various technological resources by their possible effect on operational processes and strategy development have increased the opportunities for entrepreneurial firms to expand their revenues and short term financial gains, but they are inadequate for attaining success in the long term. a plausible reason is that technological resources are valuable, yet they might be substitutable or duplicable over the course of time (cf. cohen and olsen, 2013; hadjimanolis and dickson, 2001) . some of these technology resources require a specific level of financial stability to execute within organizations. entrepreneurial firms with scarce financial resources can harness some technology resources by employing the technologies that are inexpensively available, but they are useful for temporary competitive advantage over rivals (cf. ray et al., 2004) . over and above the theoretical contributions, this study has fundamental managerial implications for tourism companies. as predicted in this research, the impact of eo is dependent on the level of environmental dynamism: the more dynamic the environment the better the return from eo. this infers that these hotels should understand their business environment to improve their performance. however, not all of the businesses have the knowledge and/or capabilities for this to be realized. here, local hotel associations or government bodies can play a role by providing training and policies to encourage innovation and growth in dynamic environments. for example, in highly dynamic environments, hotels may need more access to financial resources to support them in being more innovative and taking risks. in conclusion, the moderating impact of networking ties on the connection between eo and organizational performance is supported in this research. strong social and business networking is increase growth when aligned to an eo, and when the organization is growing fast then social networks really come into their own. hence with eo, firms can leverage their financial return and growth through establishing strong network relationships. managers can benefit from strong network ties to leverage intangible knowledge (e.g. domestic and overseas connections) and reduce the cost associated with search for potential buyers, suppliers and competitors. indeed, entrepreneurs and entrepreneurial firms are not only interested to creating and developing new ventures, bearing more of the risks, but at the same time, they are concerned with the aspect of profit making opportunities. through network ties and social interaction new ideas can emerge which pave the path to explore profitable opportunities which in particular benefit smaller firms. for hospitality businesses, this research provides empirical support for the importance of networking to manage their expectations. hospitality managers are advised to consider forming partnership alliances, dynamic interactions and networking, both socially and in business communities, as connections may be utilized for mutual benefit and success. these notions suggest that to achieve a superior business performance, firms should detect possible valuable partnering opportunities, and pledge preventive activities in response. investing in enhanced technology will not bring firms these types of returns. finally, it is vital to state that academics and practitioners should be careful when generalizing the results to various cultural environments. this research was underpinned in a particular environment of hospitality firms in japan. nonetheless, the function of dynamism and networks is pertinent to other industries. moreover, in this explication, performance was evaluated by growth and financial returns, while there is evidence that performance is multidimensional. fu et al. (2019) suggested that entrepreneurial research in hospitality and tourism should fuse both qualitative and quantitative research methods to provide richer insights. the qualitative and subsequent quantitative approach of this study sheds new light on the existing eo literature as it explores an avenue of growing importance: the role of dynamic environment and network ties on the relationship between entrepreneurial strategy-making and long term growth and short term financial return. while the qualitative, in-depth approach using a limited number of respondents it helped us have a better understanding the nuances of business and social networking ties along with managerial decision making and entrepreneurship. throughout the discussion, we have attempted to show the complementarities or contradictories between theory and practice in tourism service businesses. a subsequent k. tajeddini, et al. international journal of hospitality management 90 (2020) 102605 quantitative approach was adopted to collect and analyze data to establish credibility in the field. future research might utilize objective measurement scales for organizational performance to fortify the study design. another possibility of future study would be to examine the appropriate levels of strong networking ties required for a firm's success. luu and ngo (2018) have reported that strong business ties in collectivist cultures can limit a firm's eo. however, their research is underdeveloped on what the appropriate levels of networking ties are. additional research might scrutinize the impact of international business networks on business performance. since our measures of eo, dynamism, two network-ties constructs, financial return and growth are from self-reports, there would obviate the concern that some of our findings may be affected by measurement error or discrepancies in the level of measurement (tellis et al., 2009 ). using a longitudinal study and incorporating more variables may advance our understanding of the direction of causality between variables. entrepreneurial dimensions 'demand for new service is high but we cannot afford to go for new services simply because of our low budget'… and 'low priority of the available budget'… nevertheless, other informants put more emphasis on proclivity towards innovation. ' we have begun to use service automation and the results are satisfactory'; 'our experience shows that mobile service and self-service have enabled us to reduce our costs'; 'our genuine culture is to make every possible effort to enhance customer loyalty and satisfaction…and we do our best to pursue perfection in the details of our products and services'… one informant states how innovation is key to the success of traditional japanese firms.'; 'we have adopted self-service check-in kiosks and our customers are pleased with the easy check-in and check-out'. sociality and interactions with customers, suppliers, and competitors were frequently emphasized in our interviews. 'interactions with our stakeholders is unavoidable'; 'some of the comments we get through are our social networks are bitter, but we do our best to fix the problem as soon as possible…of course we cannot satisfy everybody, but we do our best'; 'we regularly observe the comments that our customers write about us, oftentimes we discuss with our colleagues the comments that we receive through social media'; 'our service is for people, and we have a good connection with our customers, travel agencies and trusted partners'; 'connection with our business partners and customers is vital for us and i guess social networking is a key to success for us'. 'if we do not use social media and networking, we can hardly survive; we are extremely dependent on social and business connections, as customers choose and evaluate us through social media, and we take seriously the need to make them happy to receive good feedback'. 'although many people believe that our industry is slow to change, we have adopted different and new technologies, and the results are very effective and satisfactory'; 'due to the nature of our business, we have to change our marketing strategy often'. evolution of the marketing organization: new forms for turbulent environments marketing in the network economy when do relationships pay off for small retailers? exploring targets and contexts to understand the value of relationship marketing the ties that bind: stakeholder collaboration and networking in local festivals multiple regression: testing and interpreting interactions social media in marketing: a review and analysis of the existing literature image of tourism attractions in kuwait reconceptualizing entrepreneurial orientation the future of japan's tourism: path for sustainable growth towards 2020 hospitality entrepreneurship: a journey, not a destination what is a pilot or feasibility study? a review of current practice and editorial policy the development activities of innovative and non-innovative new retail financial products: implications for success learning from near-miss events: an organizational learning perspective on supply chain disruption response supply chain involvement in business continuity management: effects on reputational and operational damage containment from supply chain disruptions. supply chain manag being an entrepreneurial exporter: does it pay? int the importance of entrepreneurship to hospitality, leisure, sport and tourism entrepreneurship: successfully launching new ventures conditioning diagnostics: collinearity and weak data in regression risk identification and analysis in the hospitality industry: practitioners' perspectives from india entrepreneurial orientation, market orientation, network ties, and performance: study of entrepreneurial firms in a developing economy taking stock of networks and organizations: a multilevel perspective using thematic analysis in psychology inter-firm market orientation as antecedent of knowledge transfer, innovation and value creation in networks the born global firm: an entrepreneurial and capabilities perspective on early and rapid internationalization entrepreneurial orientation and performance: mediating effects of technology and marketing action across industry types social support and commitment within social networking site in tourism experience the impacts of complementary information technology resources on the service-profit chain and competitive performance of south african hospitality firms applied multiple regression/ correlation analysis for the behavioral sciences international entrepreneurial orientation: conceptual considerations, research themes, measurement issues, and future research directions strategic management of small firms in hostile and benign environments strategic process effects on the entrepreneurial orientation-sales growth rate relationship designing and conducting mixed methods research the moderating impact of internal social exchange processes on the entrepreneurial orientation-performance relationship stepping up the game: the role of innovation in the sharing economy tourism communities and social ties: the role of online and offline tourist social networks in building social capital and sustainable practice crescive entrepreneurship in complex social problems: institutional conditions for entrepreneurial engagement toward a dynamic process model of entrepreneurial networking under uncertainty the effects of entrepreneurial orientation dimensions on performance in the tourism sector evaluating structural equation models with unobservable variables and measurement error toward a holistic understanding of continued use of social networking tourism: a mixed-methods approach the entrepreneurship research in hospitality and tourism how does market learning affect radical innovation? the moderation roles of horizontal ties and vertical ties information flows in capacitated supply chains with fixed ordering costs social media-based visual strategies in tourism marketing networking strategy of boards: implications for small and medium-sized enterprises the differential and synergistic effects of market orientation and entrepreneurial orientation on hotel ambidexterity interorganizational links and innovation: the case of hospital services detection of trust links on social networks using dynamic features what are the main factors attracting visitors to wineries? a pls multi-group comparison development of national innovation policy in small developing countries: the case of cyprus tourism entrepreneurship performance: the effects of place identity, self-efficacy, and gender a systematic mapping study on tourism business networks tourism development and politics in the philippines entrepreneurial orientation in hotel industry: multi-group analysis of quality certification network-based research in entrepreneurship: a decade in review a vision of tourism and the hotel industry in the 21st century analyzing japanese hotel efficiency strategic supply chain management: improving performance through a culture of competitiveness and knowledge development total quality management, corporate social responsibility and entrepreneurial orientation in the hotel industry market orientation: antecedents and consequences japan's hotel industry entrepreneurial orientation, strategic alliances, and firm performance: inside the black box entrepreneurial orientation, network resource acquisition, and firm performance: a network approach the effect of market orientation, entrepreneurial orientation and human capital on positional advantage: evidence from the restaurant industry the performance implications of designing multiple channels to fit with strategy and environment entrepreneurial behaviour, firm size and financial performance: the case of rural tourism family firms innovation as the core competency of a service organisation: the role of technology, knowledge and networks the effects of entrepreneurial orientation and marketing information on the performance of smes facilitating small firm learning networks in the irish tourism sector the design of organizations the pilot study in qualitative inquiry: identifying issues and learning lessons for culturally competent research social capital, knowledge sharing and innovation of small-and medium-sized enterprises in a tourism cluster consensus in social networks: revisited facilitators of organizational innovation: the role of life-cycle stage learning by association? interorganizational networks and adaptation to environmental change entrepreneurial orientation and firm performance: the unique impact of innovativeness, proactiveness, and risk-taking effects of brand, price, and risk on customers' value perceptions and behavioral intentions in the restaurant industry strategic orientations and strategies of high technology ventures in two transition economies different modes of open innovation: a theoretical framework and an empirical study firm networks and firm development: the role of the relational mix the impact of network and environmental factors on service innovativeness innovation, entrepreneurship, and restaurant performance: a higher-order structural model how foreign firms achieve competitive advantage in the chinese emerging economy: managerial ties and market orientation examining wechat users' motivations, trust, attitudes, and positive word-of-mouth: evidence from china accounting for common method variance in crosssectional research designs airbnb: online targeted advertising, sense of power, and consumer decision entrepreneurial orientation: the dimensions' shared effects in explaining firm performance clarifying the entrepreneurial orientation construct and linking it to performance entrepreneurial orientation and social ties in transitional economies role of network capability, structural flexibility and management commitment in defining strategic performance in hospitality industry collaborative innovation in tourism and hospitality: a systematic review of the literature entrepreneurial orientation, marketing capabilities and performance: the moderating role of competitive intensity on latin american international new ventures innovativeness and business performances in tourism smes the critical path to family firm success through entrepreneurial risk taking and image new service development competence and performance: an empirical investigation in retail banking the correlates of entrepreneurship in three types of firms strategy-making and environment: the third link entrepreneurial orientation transnational entrepreneurship, social networks, and institutional distance: toward a theoretical framework the importance of the firm and destination effects to explain firm performance a marketing culture to service climate: the influence of employee control and flexibility international tourism networks tourism in switzerland: how perceptions of place attributes for short and long holiday can influence destination choice business-level strategy and performance: the moderating effects of environment and structure he influence of social media in creating expectations. an empirical study for a tourist destination entrepreneurial orientation and digitalization in the financial service industry: a contingency approach entrepreneurial orientation in the hospitality industry: evidence from tanzania thematic analysis: striving to meet the trustworthiness criteria an exploratory study into managing value creation in tourism service firms: understanding value creation phases at the intersection of the tourism service firm and their customers oecd tourism trends and policies, 2017. oecd tourism trends and policies learning orientation, entrepreneurial orientation, innovation and their impacts on new hotel performance: evidence from surabaya the impact of entrepreneurial characteristics and organisational culture on innovativeness in tourism firms self-reports in organizational research: problems and prospects service quality, satisfaction, and customer loyalty in airbnb accommodation in thailand strategic alignment between competitive strategy and dynamic capability: conceptual framework and hypothesis development entrepreneurial orientation and business performance: an assessment of past research and suggestions for the future capabilities, business processes, and competitive advantage: choosing the dependent variable in empirical tests of the resource-based view olympics in tokyo: government backs hotel construction the influence of being part of a tourist network on hotels' financial performance effects of formal institutions on the performance of the tourism sector in the philippines: the mediating role of entrepreneurial orientation navigating the waves: the usefulness of a pilot in qualitative research tokyo 2020 olympics: expectations for the hotel industry. jil hotels & hospitality group hospitality and tourism online reviews: recent trends and future directions network ties, reputation, and the financing of new ventures how is the hospitality and tourism industry different? an empirical test of some structural characteristics short, homogeneous versions of the marlowe-crowne social desirability scale entrepreneurial networks across the business life cycle: the case of alpine hospitality entrepreneurs using multitvariate statistics investigating the influence of performance measurement on learning, entrepreneurial orientation and performance in turbulent markets effect of customer orientation and entrepreneurial orientation on innovativeness: evidence from the hotel industry in switzerland the effect of organizational structure and hoteliers' risk proclivity on innovativeness exploring the antecedents of effectiveness and efficiency the importance of human-related factors on service innovation and performance corporate entrepreneurship in switzerland: evidence from a case study of swiss watch manufacturers moderating effect of environmental dynamism on the relationship between entrepreneurial orientation and firm performance perceptions of innovativeness among iranian hotel managers environment-strategy and alignment in a restricted, transitional economy: empirical research on its application to iranian state-owned enterprises service innovativeness and the structuring of organizations: the moderating roles of learning orientation and inter-functional coordination how do hospitality entrepreneurs use their social networks to access resources? evidence from the lifecycle of small hospitality enterprises radical innovation across nations: the preeminence of corporate culture small tourism business networks and destination development the contingent value of firm innovativeness for business performance under environmental turbulence enhancing entrepreneurial orientation in dynamic environments: the interplay between top management team advice-seeking and absorptive capacity entrepreneurial orientation-hotel performance: has market orientation anything to say? strategic orientation of business enterprises: the construct, dimensionality, and measurement discriminant validity testing in marketing: an analysis, causes for concern, and proposed remedies entrepreneurial orientation: international, global and cross-cultural research transforming competitiveness into economic benefits: does tourism stimulate economic growth in more competitive destinations? idea logics and network theory in business marketing entrepreneurial orientation and small business performance: a configurational approach tourism risk and uncertainty travel & tourism economic impact role of social media in online travel information search overcoming barriers to innovation in smes in china: a perspective based cooperation network achieving fit between learning and market orientation: implications for new product performance a global market advantage framework: the role of global market knowledge competencies. int case study research design and methods online networking in the tourism industry: a webometrics and hyperlink network analysis the effect of international venturing on firm performance: the moderating influence of absorptive capacity key: cord-027719-98tjnry7 authors: said, abd mlak; yahyaoui, aymen; yaakoubi, faicel; abdellatif, takoua title: machine learning based rank attack detection for smart hospital infrastructure date: 2020-05-31 journal: the impact of digital technologies on public health in developed and developing countries doi: 10.1007/978-3-030-51517-1_3 sha: doc_id: 27719 cord_uid: 98tjnry7 in recent years, many technologies were racing to deliver the best service for human being. emerging internet of things (iot) technologies made birth to the notion of smart infrastructures such as smart grid, smart factories or smart hospitals. these infrastructures rely on interconnected smart devices collecting real-time data in order to improve existing procedures and systems capabilities. a critical issue in smart infrastructures is the information protection which may be more valuable than physical assets. therefore, it is extremely important to detect and deter any attacks or breath to the network system for information theft. one of these attacks is the rank attack that is carried out by an intruder node in order to attract legitimate traffic to it, then steal personal data of different persons (both patients and staffs in hospitals). in this paper, we propose an anomaly based rank attack detection system against an iot network using support vector machines. as a use case, we are interested in the healthcare sector and in particular in smart hospitals which are multifaceted with many challenges such as service resilience, assets interoperability and sensitive information protection. the proposed intrusion detection system (ids) is implemented and evaluated using conticki cooja simulator. results show a high detection accuracy and low false positive rates. nowadays, the deployment of the internet of things (iot) where many objects are connected to the internet cloud services becomes highly recommended in many applications in various sectors. a highly important concept in the iot is wireless sensor networks or wsns where end nodes rely on sensors that can collect data from the environment to ensure tasks such as surveillance or monitoring for wide areas [7] . this capability made the birth to the notion of smart infrastructures such as smart metering systems, smart grid or smart hospitals. in such infrastructures, end devices collecting data are connected to intermediate nodes that forward data in order to reach border routers using routing protocols. these end nodes are in general limited in terms of computational resources, battery and memory capacities. also, their number is growing exponentially. therefore, new protocols are proposed under the iot paradigm to optimize energy consumption and computations. two of these protocols are considered the de facto protocols for the internet of things (iot): rpl (routing protocol for low power lossy network) and 6lowpan (ipv6 over low power wireless private area network). these protocols are designed for constrained devices in recent iot applications. routing is a key part of the ipv6 stack that remains to be specified for 6low-pan networks [6] . rpl provides a mechanism whereby multipoint-to-point traffic from devices inside the low-power and lossy-networks (llns) towards a central control point as well as point-to-multipoint traffic from the central control point to the device inside the lln are supported [8, 9] . rpl involves many concepts that make it a flexible protocol, but also rather complex [10] : • dodag (destination oriented directed acyclic graph): a topology similar to a tree to optimize routes between sink and other nodes for both the collect and distribute data traffics. each node within the network has an assigned rank, which increases as the teals move away from the root node. the nodes resend packets using the lowest range as the route selection criteria. • dis (dodag information solicitation): used to solicit a dodag information object from rpl nodes. • dio (dodag information object): used to construct, maintain the dodag and to periodically refresh the information of the nodes on the topology of the network. • dao (dodag advertisment object): used by nodes to propagate destination information upward along the dodag in order to update the information of their parents. with the enormous number of devices that are now connected to the internet, a new solution was proposed: 6lowpan a lightweight protocol that defines how to run ip version 6 (ipv6) over low data rate, low power, small footprint radio networks as typified by the ieee 802.15.4 radio [11] . in smart infrastructures, the huge amount of sensitive data exchanged among these modules and throughout radio interfaces need to be protected. therefore, detecting any network or device breach becomes a high priority challenge for researchers due to resource constraints for devices (low processing power, battery power and memory size). rank attack is one of the most known rpl attacks where the attacker attracts other nodes to establish routes through it by advertising false rank. this way, intruders collect all the data that pass in the network [12] . for this reason, developing specific security solutions for iot is essential to let users catch all opportunities it offers. one of defense lines designed for detecting attackers is intrusion detection systems [13] (ids). in this paper, we propose a centralized anomaly-based ids for smart infrastructures. we chose o-svm (one class support vector machines) algorithm for its low energy consuming compared to other machine learning algorithms for wireless sensor network (wsn) [20] . as a use case, we are interested in smart hospital infrastructures. such hospitals have a wide range of resources that are essential to maintain their operations, patients, employees and the building itself [1, 2] safety such as follow: • remote care assets: medical equipment for tele-monitoring and tele-diagnosis. • networked medical devices: wearable mobile devices (heartbeat bracelet, wireless temperature counters, glucose measuring devices...) or an equipment installed to collect health service related data. • networking equipment: standards equipment providing connectivity between different equipment (transmission medium, router, gateway...). • data: for both clinical and patient data, and staff data, which considered the most critical asset stored in huge datasets or private clouds. • building and facilities: the sensors are distributed in the hospital building that monitor the patient safety (temperature sensor for patient room and operation theater, gas sensor are among used sensors). we target a common iot architecture that can be considered for smart hospitals. in such architecture, there are mainly three type of components: sensing node: composed of remote care asset, network medical device and different sensors. these sensors will send different type of data and information (patient and staff data, medical equipment status...). they are linked to microcontrollers and radio modules to transmit these data to the processing unit [3] . edge router: an edge router or border router is a specialized router residing at the edge or boundary of a network. this node ensures the connectivity of its network with external networks; a wide area network or the internet. an edge router uses an external border gateway protocol, which is used extensively over the internet to provide connectivity with remote networks. instead of providing communication with an internal network, which the core router already manages, a gateway may provide communication with different networks and autonomous systems [4] . interface module and database: this module is the terminal of the network containing all the collected data from different nodes of the network and analyze those information in order to ensure the safety of patient and improve the healthcare system. figure 1 [5], presents the typical iot e-health architecture, where sensors are distributed (medical equipment,room sensors and others) and send data to the iot gateway. in one hand, this gives the opportunity to medical supervisor to control the patient health status. in the other hand, this data will be saved into databases for more analysis. the rest of the paper is structured as follows. section 2 presents the related work. section 3 presents the rank attack scenario. section 4 presents our proposed approach. section 5 presents our main results and sect. 6 concludes the paper and presents its perspectives. rpl protocol security especially in the healthcare domain is a crucial aspect for preserving personnel data. nodes rank is an important parameter for an rpl network. it can be used for route optimization, loop prevention, and topology maintenance. in fact, the rank attack can decrease the network performance in terms of packet delivery rate (pdr) to almost 60% [23] . there were different proposed solutions to detect and mitigate rpl attacks such as rank authentication mechanism to avoid false announced ranks by using cryptography technique which was proposed in [24] . however, this technique is not very efficient because of its high computational cost and energy consumption. authors in [25] propose a monitoring node (mn) based scheme but it is also not efficient because using a large network of mns causes a communication overhead. in [26] , authors propose the ids called "svelte" that can only be used for detection of simple rank attack and has high false alarm rate. a host-based ids was proposed in [27] . the ids uses a probabilistic scheme but it is discouraged by rfc6550 for resource constrained networks. routing choice "rc" was proposed by zhag et al. [28] . it is not directly related to the rank attack but it is based on false preferred parent selection. it has a high communication overhead in rpl networks. trusted platform module (tpm) was proposed by seeber et al. [29] . it introduces an overlay network of tpm nodes for detection of network attacks. securerpl (srpl) [30] technique prevents rpl network from rank attack, however it is characterized by a high energy consumption. therefore, anomaly based solutions using machine learning permit a more efficient detection. authors of [22] compared several unsupervised machine learning approaches based on local outlier factor, near neighbors, mahalanobis distance and svms for intrusion detection. their experiments showed that o-svm is the most appropriate technique to detect selective forwarding and jamming attacks. actually, we rely on these results in our choice of o-svm. rank attack is one of well known attacks against the routing protocol for low power and lossy networks (rpl) protocol in the network layer of the internet of things. the rank in rpl protocol as shown in fig. 2 is the physical position of the node with respect to the border router and neighbor nodes [12] . since our network is dynamic due to the mobility of its nodes (sensor moving with patient...), the rpl protocol periodically reformulates the dodag. as shown in fig. 3 , an attacker may insert a malicious mote into the network to attract other nodes to establish routes through it by advertising false ranks while the reformulation of the dodag is done [14] . by default, rpl has the security mechanisms to mitigate the external attacks but it can not mitigate the internal attacks efficiently. in that case, the rank attack is considered one of dangerous attacks in dynamic iot networks since the attacker controls an existing node (being one of the internal attack that can affect the rpl) in the dodag or he can identify the network and insert his own malicious node and that node will act as the attack node as shown in fig. 4 . the key features required for our solution are to be adaptive, lightweight, and able to learn from the past. we design an iot ids and we implement and evaluate it as authors did in [18, 20] . placement choice: one of the important decision in intrusion detection is the placement of the ids in the network. we use a centralized approach by installing the ids at the border router. therefore, it can analyze all the packets that pass through it. the choice of the centralized ids was done to avoid the placement of ids modules in constrained devices which requires more storage and processing capabilities [15, 16] . however, theses devices have limited resources. detection method choice: an intrusion detection system (ids) is a tool or mechanism to detect attacks against a system or a network by analyzing the activity in the network or in the system itself. once an attack is detected an ids may log information about it and/or report an alarm [15, 16] . broadly speaking, we aim to choose the anomaly based detection mechanisms: it tries to detect anomalies in the system by determining the ordinary behavior and using it as baseline. any deviations from that baseline is considered as an anomaly. this technique have the ability to detect almost any attack and adapt to new environments. we chose support vector machines (svm) as an anomaly based machine learning technique. it is a discriminating classifier formally defined by a separating hyper-lane. given labeled training data (supervised learning), the algorithm outputs an optimal hyper-lane which categorizes new examples. in two dimensional space this hyper-lane is a line dividing a plane in two parts where each class lays in either side. it uses a mathematical function named the kernel to reformulate data. after these transformations, it defines an optimal borderline between the labels. mainly, it does some extremely complex data transformations to find a solution how to separate the data based on the labels or outputs defined. the concept of svm learning approach is based on the definition of the optimal separating hyper-plane (fig. 5) [21] which maximizes the margin of the training data [17, 18] . the choice of this machine learning algorithm refers to one important point, it works well with the structured data as tables of values compared to other algorithms. we implement the ids in the smart iot gateway shown in fig. 1 . to investigate the effectiveness of our proposed ids, we implement three scenarios of rank attack using contiki-cooja simulator [19] . we assess how our ids module can detect them. we present next the simulation setup, evaluation metrics, and we discuss the results achieved. our simulation scenario consists of a total 11 motes spread across an area of 200 × 200 m (simulation of area of hospital where different sensors are placed in every area to control the patient rooms). the topology is shown in fig. 6 using four scenarios. there is one sink (mote id:0 with green dot) and 10 senders (yellow motes from id:1 to id:10). every mote sends packet to the sink at the rate of 1 packet every 1 min. we implement the centralized anomaly based ids at the root mote or the sink and we collect and analyze network data as shown in table 1 summarizes the used simulation parameters. we run four simulation scenarios for 1 h (fig. 6 ): • scenario 1: iot network without malicious motes. • scenario 2: iot network with 1 randomly placed malicious mote. • scenario 3: iot network with 2 randomly placed malicious motes. • scenario 4: iot network with 4 randomly placed malicious motes. to evaluate the accuracy of the proposed ids, we rely on the energy consumption parameter. we collect power tracking data per mote in terms of radio on energy, radio transmission tx energy, radio reception rx energy and radio interfered int energy. in order to calculate this metrics we used the formula [31] (eq. 1, table 2 ) as follow : energy(mj) = (transmit * 19.5ma + listen * 21.8ma we used data containing 1000 instances of consumed energy values for each node in the network. figure 7 depicts the evolution of power tracking of each node in the four scenarios: • scenario 1: when we have a normal behavior in the network, all the sensors show a regular energy consumption in terms of receiving (node 0) and sending (nodes from 1 to 10). we use this simulation to collect the training data for the proposed ids. • scenario 2, 3 and 4: for those scenarios, we have a high sending values for the malicious motes. this is explained by the fact that when a malicious mote joins the network, it asks the other motes to recreate the dodag tree and also to send data that they have, in order to steal as much data as it can. that is why it have a high receiving values too. the other motes do not distinguish that this is a malicious mote, therefore they recreate the dodag tree, and send their information through the malicious node. we used the first simulation scenario as dataset for our ids, describing the normal behavior of the network. this 1 h information was enough to detect the malicious activities of the rank attack. meanwhile, each time we add a malicious mote, the anomaly detection rate increases as shown in fig. 8 . in each simulation of malicious mote, the proposed ids indicates the anomaly detection ratio which increases each time while adding another malicious mote. this aims to determine the impact of the number malicious motes compared to normal behavior of the system. in this paper, we propose an intrusion detection system "ids" for smart hospital infrastructure data protection. the chosen ids is centralized and anomaly based using a machine learning algorithm osvm. simulation results show the efficiency of the approach by a high detection accuracy which is more precise when the number of malicious nodes increases. as future work, we are interested in developing a machine learning based ids for more rpl attacks detection. furthermore, we aim to extend this solution to anomaly detection in iot systems composed not only of wsn networks but also of cloud-based services. open access this chapter is licensed under the terms of the creative commons attribution 4.0 international license (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license and indicate if changes were made. the images or other third party material in this chapter are included in the chapter's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the chapter's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. smart hospital based on internet of things smart hospital care system middleware challenges for wireless sensor networks study of the gateway of wireless sensor networks smart e-health gateway: bringing intelligence to internet-ofthings based ubiquitous healthcare systems rpl in a nutshell: a survey evaluating and analyzing the performance of rpl in contiki rpl: ipv6 routing protocol for lowpower and lossy networks security considerations in the ip-based internet of things an implementation and evaluation of the security features of rpl ipv6 over low-power wireless personal area networks (6lowpans): overview, assumptions, problem statement, and goals rank attack using objective function in rpl for low power and lossy networks intrusion detection systems for wireless sensor networks: a survey routing attacks and countermeasures in the rpl-based internet of things a survey of intrusion detection in internet of things active learning for wireless iot intrusion detection genetic algorithm to improve svm based network intrusion detection system proceedings the 2009 international workshop on information security and application iot emulation with cooja read: reliable event and anomaly detection system in wireless sensor networks rbf kernel based support vector machine with universal approximation and its application a comparative study of anomaly detection techniques for smart city wireless sensor networks the impacts of internal threats towards routing protocol for low power and lossy network performance vera-version number and rank authentication in rpl specification-based ids for securing rpl from topology attacks svelte: real-time intrusion detection in the internet of things secure parent node selection scheme in route construction to exclude attacking nodes from rpl network intrusion detection system for rpl from routing choice intrusion towards a trust computing architecture for rpl in cyber physical systems a secure routing protocol based on rpl for internet of things impact of rpl objective functions on energy consumption in ipv6 based wireless sensor networks key: cord-103418-deogedac authors: ochab, j. k.; g'ora, p. f. title: shift of percolation thresholds for epidemic spread between static and dynamic small-world networks date: 2010-11-12 journal: nan doi: 10.1140/epjb/e2011-10975-6 sha: doc_id: 103418 cord_uid: deogedac the aim of the study was to compare the epidemic spread on static and dynamic small-world networks. the network was constructed as a 2-dimensional watts-strogatz model (500x500 square lattice with additional shortcuts), and the dynamics involved rewiring shortcuts in every time step of the epidemic spread. the model of the epidemic is sir with latency time of 3 time steps. the behaviour of the epidemic was checked over the range of shortcut probability per underlying bond 0-0.5. the quantity of interest was percolation threshold for the epidemic spread, for which numerical results were checked against an approximate analytical model. we find a significant lowering of percolation thresholds for the dynamic network in the parameter range given. the result shows that the behaviour of the epidemic on dynamic network is that of a static small world with the number of shortcuts increased by 20.7 +/1.4%, while the overall qualitative behaviour stays the same. we derive corrections to the analytical model which account for the effect. for both dynamic and static small-world we observe suppression of the average epidemic size dependence on network size in comparison with finite-size scaling known for regular lattice. we also study the effect of dynamics for several rewiring rates relative to latency time of the disease. the epidemic modelling has become a significant and needed branch of complex systems research, as we have witnessed the recent epidemic threats and outbreaks of human diseases (h5n1 and h1n1 influenzas [10, 8] or severe acute respiratory syndrome [9, 2] ) or animal (foot-and-mouth disease [6] ) and plant diseases alike (e.g. dutch elm disease [14] or rhizomania [13] ). there are two crucial characteristics of the epidemic spread that make it complicated to be modelled on the one hand, and costly to be prevented in reality on the other: firstly, a number of infectious diseases exhibit long-range transmissions of varied 2 model we adopt watts-strogatz model of a small-world network [16] : first we take a 2-dimensional square lattice with n = l 2 nodes and 2n undirected edges. to avoid some finite-size effects we impose periodic boundary conditions for the lattice (i.e. we get a torus). then, we add a number of undirected edges between random pairs of nodes. the number of additional edges ('shortcuts') is set as 2φn , hence φ is shortcut probability per underlying bond. network with φ = 0 is just a regular lattice. for nonzero φ we call the network a static small-world. the third type of network is a dynamic small-world. one can construct it by randomly distributing shortcuts in every time step of simulation. here, we choose 2φn nodes randomly, and keep them fixed for the whole run of the epidemic. in every time step we randomly launch shortcuts anchored in these nodes, which means the dynamics consists in rewiring one end of these shortcuts. for the sake of simplicity we allow for multiple shortcuts being incident with the same node, for shortcuts leading to nearest neighbours, and for loops being formed. the construction of the source nodes launching shortcuts allows for an easier interpretation of the network: the fixed nodes could correspond to centres of activity that can be identified as in the real world networks. the sir (susceptible-infectious-removed) model is adopted, where the disease is transmitted along the edges of the network in discrete time steps. the probability p of infecting a susceptible node by an infectious neighbour during one time step is set equal for short-and long-range links, both static and dynamic. the latency time l of the disease is measured in discrete time units (we take l = 3, 4). thus, an infectious node can transmit disease to susceptible nodes with probability p every turn for the period of l turns, and after that time it is removed, i.e. it cannot infect nor be reinfected. every simulation starts with only one initially infecting node, all others being susceptible, and it ends when no node in the infectious state is left. sample snapshots of the epidemic time development are presented in fig.2 . grassberger [4] related the probability of infection to the probability t in bond percolation through t = l t=1 p(1 − p) t−1 = 1 − (1 − p) l , where t is the so called transmissibility (it is the total probability of a node infecting one of its neighbours during the whole latency time). in the case of 2-dimensional square lattice the bond percolation threshold is t c = 0.5. numerical data the linear lattice size used for most calculations is l = √ n = 500. in section 5.2 we take sizes l = 50, 63, 79, 100, 126, 158, 199, 251, 315, 397, 500. the disease latency is set to l = 3 (for faster simulations reported in sec. 5) or l = 4 (in sec. 5.3 in order to get larger set of dynamic rates). the range of probability p scanned is p = 0.05 − 0.22 (depending on φ) with resolution of 1/1024, which translates to around t = 0.15 − 0.5. for every p and φ the epidemic is run 1024 times with random distributions of shortcuts each time. the fraction of shortcuts is φ = 0 − 0.5, with steps of 0.025. the simulations are performed for both static and dynamic small-world network. in the study of the epidemic spread on networks, we stick to the percolation theory as a reference point. in the theory, a percolation threshold would be the value of p that generates an epidemic cluster spanning between the boundaries of the whole system. otherwise, it is possible to define percolation as the point at which a cluster of macroscopic size forms (i.e. it occupies a finite fraction of the system for n → ∞). we employ the latter to define percolation threshold (numerically) as the point at which the average epidemic's size divided by n rises above a certain value (here, set to 0.00115). the average is taken over a number of reruns for different shortcut drawings. as we can perform simulations only for finite sizes, we take the results for a relatively large network of √ n = 500. the choice of the threshold value is taken so as to calibrate the results for the static network to the previously confirmed analytical result. we take as the theoretical model [7] , where the generating function and series expansion methods were used to find the approximate position of bond percolation transition in 2d small-world network, which corresponds to the epidemic spread on what we refer to as static small-world. we can account for the change between static and dynamic networks analytically using the model known for static small-world network [7] . as the original theory has no time variable, it would be a hard task to introduce dynamics explicitly. the solution, however, is astonishingly simple. one can estimate the average number of nodes infected through shortcuts during latency time l: i.e. the number of shortcuts in the static network multiplied by the total probability of infecting a neighbouring node (this probability is the same for both regular links and shortcuts). the analogous expression for the dynamic network is found easily where the sum is an average number of infections transmitted by a single source of dynamic shortcuts for a given latency time. it comes from the fact that a dynamic shortcut can pass infection several times (the factor p i ), while in the static case a node could infect only once (since nodes cannot be reinfected in the sir model). this expression predicts lowering percolation thresholds, although numerical values of the shift are considerably smaller than the ones obtained from simulations. figures 3(a) -3(c) explain why the above expression is not yet correct: it is derived only for the source nodes passing the disease on, while it disregards the fact that the node may itself become infected via long-range link. since on the static network there is no difference between shortcuts' source and target nodes, we can attach the factor φn/2 to both infection graphs presented in fig.3(a) . for dynamic network, the graphs in fig.3 (b)-3(c) for infecting a source node through a regular link and through a dynamic link give different counts of how many shortcuts were used. the former was given in eq.2 as lp, and the latter actually utilises the same formula, but with the substitution l → l + 1. in total, we get we assume that n dyn = n stat if the epidemic on both networks has the same percolation threshold. thus, we can obtain the ratio of the two shortcut densities where p is the probability of infection in one time step and l latency time of a disease. now, we can calculate t c (rφ)) numerically, just as we do it with the fitted fig.4 . the ratio in eq.4 was used to plot the lower solid line in fig.4 . in figure 4 we plot numerical and theoretical values of percolation thresholds t c for both static and dynamic small-worlds. the resulting t c (φ) data points for static small-world network agree with the analytical approximation [7] , which confirms the validity of calibration procedure. as the lower dataset marks the effect of network dynamics, the difference between the two networks proves to be systematic and significant. the dashed line is a fit t c [(1 + v)φ] of the analytical model for the static network, where the fitted parameter v may be interpreted as a virtual percentage of additional shortcuts needed to obtain the dynamic network percolation thresholds. it follows from the fit that percolation thresholds for dynamic network are lower as if the shortcut density were (1+v)φ (where v = 0.207 ± 0.014 is the fitted parameter). nonetheless, qualitatively the epidemic on dynamic small world behaves in the same way as on the static one for the given range of parameters (φ = 0.5 corresponds to every node in the network having on average two additional links). the analytical correction slightly exceeds the values of simulation data points, but the overall agreement is satisfactory. the difference between the analytical solution and the observed behaviour does not exceed the shift between static and dynamic networks obtained from simulations. the discrepancy might be due to the method of calculating percolation thresholds from numerical data or due to the approximate nature of the correction. the primary motivation of checking finite-size scaling for the system was to utilise it to determine the percolation thresholds very accurately (as the shift of thresholds observed in fig.4 is relatively small), and to arrive at threshold value for infinite system size. yet, it is worth noting at this point that the knowledge of thresholds for infinite system sizes would not usually be appropriate for evaluation of risks in the real epidemic, given the sizes of some real networks. to study the size of finite-size effects is thus vital on its own right. in figure 5 (b) the convergence of the average epidemic size to the threshold behaviour can be observed, and the significant dependence on system size ranges up to the epidemic size of around 0.5n and interval of transmissibility of length around 0.08 (the numbers are very rough estimates). as presented in fig.5(a) where t n are the values of transmissibility for a given system size n and a set section position, and t ∞ is the percolation threshold for infinite system size. for regular lattice t ∞ is fitted correctly for various section positions as 0.500±0.005 (the error may vary for different sections, but does not exceed the given value). it appears that the dependence on system size for small-world networks (both static and dynamic) is dissimilar to the one of regular lattices, as can be seen in fig.5.2 (φ = 0.05) . it is suppressed to smaller values of the average epidemic size. for the shortcuts density φ = 0.5 the dependence on system size is already visible only below the epidemic size of 0.03. because the dependence of the epidemic size on size of the system becomes of the order of magnitude of statistical fluctuations (the quality of the data can already be seen in the fig.5. 2), any attempts to utilise finite-size scaling for determining percolation threshold are not viable. indeed, the errors do not allow us to check if the same form of finite-size dependence as in eq.5 holds. dependence on the rate of dynamics one can generalise the theoretical analysis for various rates of dynamics, given the formula in eq.2. to explain this, let us notice that there are two time scales in the model: the latency time l of the infection and the duration 1/d between consecutive rewirings of dynamic links (both measured in discrete time steps of the epidemic spread). as the choice of latency l only rescales the total probability of infection t = t (p, l), we can dispose of it, and the crucial parameter ld that accounts for the shift of percolation thresholds is defined as the number of shortcut movements during latency time. obviously, for a static network we get d = 0, while for all the above analysis of dynamic network we have ld = 3 (l = 3 and the rewiring was performed every turn, so d = 1). depending on the interpretation of the model, we could also consider d > 1. however, if p is to be the probability of infection during one time step it is reasonable that shortcuts rewiring faster than one time step would infect with appropriately smaller probability, and there would be no further shift of percolation thresholds. since the epidemic spreads with discrete time, which results in sums as in eq.2, we are interested in rational numbers d ∈ [0, 1]∩q, particularly of the form 1/i, i ∈ z. what we need is n dyn calculated in a similar way to that in eq.2. here, we take l = 4, d = 1, 1/2, . . . , 1/7, and we plot both the numerical and theoretical results for φ = 0.25 in fig.7 . theoretical derivation is to be found in the appendix. the theoretical approach gives slightly exceeding values (the scale should be noted), which is the same effect as discussed at the end of section 5.1. we have shown that introducing dynamics of the long-range links in a smallworld network significantly lowers an epidemic threshold in terms of probability of disease transmission, although the overall dependence on number of shortcuts stays the same. consequently, the risk of an epidemic outbreak is higher than in any calculations involving static models. the effect remains secondary to the influence that the introduction of additional of shortcuts has on the spread of the disease. it should be noted that the shift of percolation thresholds depends on the relative measure of dynamics of the network with respect to the process on the network (rewiring rate and latency time, respectively). any accurate analytical calculation or simulation should take this quantity as a significant parameter, to be estimated for a particular disease and type of the network. as in reality we consider only finite-size networks, and real epidemic sizes do not usually reach values of the order of even 10% of the system size, the information on finite-size effects seemed very much needed. that the epidemic outbreak magnitude does not depend on the system size for small-world networks as much as it does for regular lattices means that we should not expect the epidemic outbreaks below transmissibility threshold value. thus, finite-size effects seem to become secondary, as well. the usefulness of such a model for risk prediction still depends on our knowledge of the probability of transmission (p or t ) of a given disease, which is not easy to obtain for diseases spreading outside of well controlled environments like hospitals. relatively good estimates, thanks to the nature of transmission, exist for syphilis. transmissibility of the disease is reviewed in [3] , where authors give values ranging from 9.2% to 63% per partner, and decide on 60% as the lower boundary for untreated disease. this seems to be well above the epidemic threshold, irrespective of very different network topology for such diseases. however, this also shows that errors on estimates of transmission probabilities exceed the effect of threshold shifting studied here. though the 2-dimensional network structure used here may be said to correspond mainly to that of plantations, it is worth noting its generality: nodes may be interpreted as plants, animals or humans, but also on a larger scale as farms, households, or cities and airports; in turn, long-range links could mean wind (on farms), disease vectors, occasional human contacts, or airline connections. still, it has some other fairly realistic characteristics: according to [11] , who analysed the structure of human social interactions, 'the majority of encounters (76.70%; 75.26-78.07) occur with individuals never again encountered by the participant during the 14 days of the survey.' this may mean that about 24% of the repeated contacts corresponds roughly to our regular underlying lattice with z = 4 neighbours for each node, while the 76% correspond to around 3z dynamic contacts distributed over over 14 days. this gives on average φ ≈ 0.20 for simulation with daily time steps, which lies within the parameter range studied in this paper. a 0 (p, 5) = · · · · · a 1 (p, 5) = ·| · · · · + · · · · |· = 2 · | · · · · a 2 (p, 5) = · · | · · · + · · · | · · = 2 · ·| · · · (7) a 11 (p, 5) = ·| · · · | · a 12 (p, 5) = ·| · ·| · · + · · | · ·|· = 2 · | · ·| · · where the symbol '|' marks rewiring, and '·' one epidemic time step during latency period. for instance ·| · · would correspond to three turns with one rewiring, during which either 0, 1 or 2 infections are possible. the derivation involves only very easy combinatorics, but for longer latency periods one would need to repeat these calculations to obtain more terms and different prefactors. now, one can easily obtain expressions for n dyn for any 1/d ∈ z. below we give only the general expression for 1/d ≥ l: where l = 4. the first term in the brackets corresponds to fig.3 (b) and the second to fig.3(c) . for greater numbers of rewiring per turn d, we need to consider the terms a 11 , a 12 . the result is plotted against simulated data in fig.7 . modelling control of epidemics spreading by long-range interactions modeling the sars epidemic the natural history of syphilis: implications for the transmission dynamics and control of infection on the critical behavior of the general epidemic process and dynamical percolation epidemic dynamics on an adaptive network dynamics of the 2001 uk foot and mouth epidemic: stochastic dispersal in a heterogeneous landscape percolation and epidemics in a two-dimensional small world world health organization. avian influenza (h5n1) world health organization. severe acute respiratory syndrome (sars) world health organization. swine influenza (h1n1) dynamic social networks and the implications for the spread of infectious disease modelling development of epidemicswith dynamic small-world networks a model for the invasion and spread of rhizomania in the united kingdom: implications for disease control strategies dutch elm disease and the future of the elm in the uk: a quantitative analysis susceptible-infected-recovered epidemics in dynamic contact networks collective dynamics of 'small-world' networks this work is supported by the international phd projects programme of the foundation for polish science within the european regional development fund of the european union, agreement no. mpd/2009/6. below we present the way to calculate n dyn for latency periods l = 4, 5 (in the simulation we set l = 4, but we need to take into account also the process from fig.3(c) , which in a sense increases latency by 1). let us definewhere we substituted t (1) for p on the right-hand sides, and we leave out the argument p in t (p, l) to simplify the notation. those quantities correspond to the average number of infections during one latency period depending on when the rewiring takes place. one can present those diagrammatically (here for l = 5) as key: cord-220116-6i7kg4mj authors: mukhamadiarov, ruslan i.; deng, shengfeng; serrao, shannon r.; priyanka,; nandi, riya; yao, louie hong; tauber, uwe c. title: social distancing and epidemic resurgence in agent-based susceptible-infectious-recovered models date: 2020-06-03 journal: nan doi: nan sha: doc_id: 220116 cord_uid: 6i7kg4mj once an epidemic outbreak has been effectively contained through non-pharmaceutical interventions, a safe protocol is required for the subsequent release of social distancing restrictions to prevent a disastrous resurgence of the infection. we report individual-based numerical simulations of stochastic susceptible-infectious-recovered model variants on four distinct spatially organized lattice and network architectures wherein contact and mobility constraints are implemented. we robustly find that the intensity and spatial spread of the epidemic recurrence wave can be limited to a manageable extent provided release of these restrictions is delayed sufficiently (for a duration of at least thrice the time until the peak of the unmitigated outbreak) and long-distance connections are maintained on a low level (limited to less than five percent of the overall connectivity). the covid-19 pandemic constitutes a severe global health crisis. many countries have implemented stringent non-pharmaceutical control measures that involve social distancing and mobility reduction in their populations. this has led to remarkably successful deceleration and significant "flattening of the curve" of the infection outbreaks, albeit at tremendous economic and financial costs. at this point, societies are in dire need of designing a secure (partial) exit strategy wherein the inevitable recurrence of the infection among the significant non-immune fraction of the population can be thoroughly monitored with sufficient spatial resolution and reliable statistics, provided that dependable, frequent, and widespread virus testing capabilities are accessible and implemented. until an effective and safe vaccine is widely available, this would ideally allow the localized implementation of rigorous targeted disease control mechanisms that demonstrably protect people's health while the paralyzed branches of the economy are slowly rebooted. mathematical analysis and numerical simulations of infection spreading in generic epidemic models are crucial for testing the efficacy of proposed mitigation measures, and the timing and pace of their gradual secure removal. specifically, the employed mathematical models need to be (i) stochastic in nature in order to adequately account for randomly occurring or enforced disease extinction in small isolated communities, as well as for rare catastrophic infection boosts and (ii) spatially resolved such that they properly capture the significant emerging correlations among the susceptible and immune subpopulations. these distinguishing features are notably complementary to the more detailed and comprehensive computer models utilized by researchers at the university of washington, imperial college london, the virginia bioinformatics institute, and others: see, e.g., (1) (2) (3) (4) (5) (6) . we report a series of detailed individual-based kinetic monte carlo computer simulation studies for stochastic variants (7, 8) of the paradigmatic susceptible-infectious-recovered (sir) model (9, 10) for a community of about 100,000 individuals. to determine the robustness of our results and compare the influence of different contact characteristics, we ran our stochastic model on four distinct spatially structured architectures, namely i) regular two-dimensional square lattices, wherein individuals move slowly and with limited range, i.e., spread diffusively; ii) two-dimensional small-world networks that in addition incorporate substantial long-distance interactions and contaminations; and finally on iii) random as well as iv) scale-free social contact networks. for each setup, we investigated epidemic outbreaks with model parameters informed by the known covid-19 data (4). to allow for a direct comparison, we extracted the corresponding effective infection and recovery rates by fitting the peak height and the half-peak width of the infection growth curves with the associated classical deterministic sir rate equations that pertain to a wellmixed setting. we designed appropriate implementations of social distancing and contact reduction measures on each architecture by limiting or removing connections between individuals. this approach allowed us to generically assess the efficacy of non-pharmaceutical control measures. although each architecture entails varied implementations of social distancing measures, we find that they all robustly reproduce both the resulting reduced outbreak intensity and growth speed. as anticipated, a dramatic resurgence of the epidemic occurs when mobility and contact restrictions are released too early. yet if stringent and sufficiently long-lasting social distancing measures are imposed, the disease may go extinct in the majority of isolated small population groups. in our spatially extended lattice systems, disease spreading then becomes confined to the perimeters of a few larger outbreak regions, where it can be effectively localized and specifically targeted. for the small-network architecture, it is however imperative that all longrange connections remain curtailed to a very low percentage for the control measures to remain effective. intriguingly, we observe that an infection outbreak spreading through a static scale-free network effectively randomizes its connectivity for the remaining susceptible nodes, whence the second wave encounters a very different structure. in the following sections, we briefly describe the methodology and algorithmic implementations as well as pertinent simulation results for each spatial or network structure; additional details are provided in the supplementary materials. we conclude with a comparison of our findings and a summary of their implications. our first architecture is a regular two-dimensional square lattice with linear extension ܮ = 448 subject to periodic boundary conditions (i.e., on a torus). initially, ܰ = ܵሺ0ሻ + ܫሺ0ሻ + ܴሺ0ሻ = 100,000 individuals with fixed density ߩ = ܮ/ܰ ଶ ≈ 0.5 are randomly placed on the lattice, with at most one individual allowed on each site. almost the entire population begins in the susceptible state ܵሺ0ሻ; we start with only 0.1 % infected individuals, ܫሺ0ሻ = 100, and no recovered (immune) ones, ܴሺ0ሻ = 0. all individuals may then move to neighboring empty lattice sites with diffusion rate ݀ (here we set this hopping probability to 1). upon their encounter, infectious individuals irreversibly change the state of neighboring susceptible ones with set rate :ݎ ܵ + ܫ → ܫ + .ܫ any infected individual spontaneously recovers to an immune state with fixed rate ܽ: ܫ → ܴ. (details of the simulation algorithm are presented in the supplementary materials.) for the recovery rate, we choose 1/ܽ ≅ 6.667 days (set to 0.15 monte carlo steps, mcs) informed by known covid-19 characteristics (4). to determine the infection rate ,ݎ we run simulations for various values, fit the peak height and width of the ensuing epidemic curves with the corresponding sir rate equations to extract the associated basic reproduction ratio ܴ (as explained in the supplementary materials, see figure s1 ), and finally select that value for ݎ for our individual-based monte carlo simulations that reproduces the ܴ ≈ 2.4 for covid-19 (4). we perform 100 independent simulation runs with these reaction rates, from which we obtain the averaged time tracks for ܫሺݐሻ and ܴሺݐሻ, while of course ܵሺݐሻ = ܰ − ܫሺݐሻ − ܴሺݐሻ and ܴሺݐሻ = ܽ ܫሺ′ݐሻ ′ݐ݀ ௧ . the standard classical sir deterministic rate equations assume a well-mixed population and constitute a mean-field type of approximation wherein stochastic fluctuations and spatial as well as temporal correlations are neglected; see, e.g., (11, 12) . near the peak of the epidemic outbreak, when many individuals are infected, this description is usually adequate, albeit with coarse-grained `renormalized' rate parameters that effectively incorporate fluctuation effects at short time and small length scales. however, the mean-field rate equations are qualitatively insufficient when the infectious fraction ܫሺݐሻ/ܰ is small, whence both random number fluctuations and the underlying discreteness and associated internal demographic noise become crucial (11) (12) (13) . already near the epidemic threshold, which constitutes a continuous dynamical phase transition far from thermal equilibrium, c.f. figure s3 in the supplementary materials, the kinetics is dominated by strong critical point fluctuations. these are reflected in characteristic initial power laws rather than simple exponential growth of the ܫሺݐሻ and ܴሺݐሻ curves (14) , as demonstrated in figure s1 (supplemental information). nor can the deterministic rate equations capture stochastic disease extinction events that may occur at random in regions where the infectious concentration has reached small values locally. the rate equations may be understood to pertain to a static and fully connected network; in contrast, the spreading dynamics on a spatial setting continually rewires any infectious links keeping the epidemic active (6, 15) . consequently, once the epidemic outbreak threshold is exceeded, the sir rate equations markedly underestimate the time-integrated outbreak extent reflected in the ultimate saturation level ܴ ஶ = ܴሺݐ → ∞ ሻ, as is apparent in the comparison figure s1 (supplemental information). once the instantaneous fraction of the population has reached the threshold 10 %, ܫሺݐሻ = 0.1 ܰ, we initiate stringent social distancing that we implement through a strong repulsive interaction between any occupied lattice sites (with ݊ = 1), irrespective of their states ܵ, ,ܫ or ܴ; and correspondingly an attractive force between filled and empty (݊ = 0) sites, namely the ising lattice gas potential energy ܸሺ{݊ }ሻ = ܭ ∑ ሺ2 ݊ − 1ሻ ൫2 ݊ − 1൯ ழ,வ with dimensionless strength ܭ = 1, where the sum extends only over nearest-neighbor pairs on the square lattice. the transfer of any individual from an occupied to an adjacent empty site is subsequently determined through the ensuing energy change ∆ܸ by the metropolis transition probability ݓ = min{1, exp ሺ−∆ܸሻ} (16, 17) , which replaces the unmitigated hopping rate ݀. as a result, both the mobility as well as any direct contact between individuals on the lattice are quickly and drastically reduced. for sufficiently small total density ߩ = ܮ/ܰ ଶ , most of the individuals eventually become completely isolated from each other. for our ߩ = 0.5, the disease will continue to spread for a short period, until the repulsive potential has induced sufficient spatial anti-correlations between the susceptible individuals. the social-distancing interaction is sustained for a time duration ܶ, and then switched off again. with increasing mitigation duration ܶ, the likelihood for the disease to locally go extinct in isolated population clusters grows markedly. as seen in the bottom row, the prevalence and spreading of the infection thus becomes confined to the perimeters of a mere few remaining centers. hence we observe drastically improved mitigation effects for extended ܶ: as shown in figure 2 , the resurgence peak in the ܫሺݐሻ curve assumes markedly lower values and is reached after much longer times. in fact, the time ߬ሺܶሻ for the infection outbreak to reach its second maximum increases exponentially with the social-distancing duration, as evidenced in the inset of figure 2 (see also figure 6 below). we emphasize that localized disease extinction and spatial confinement of the prevailing disease clusters represent correlation effects that cannot be captured in the sir mean-field rate equation description. in modern human societies, individuals as well as communities feature long-distance connections that represent `express' routes for infectious disease spreading in addition to short-range links with their immediate neighbors. to represent this situation, we extend our regular lattice with diffusive propagation to a two-dimensional newman-watts small-world network (18) , which was previously applied to the study of plant disease proliferation (19) . in contrast to the watts-strogatz model (20) , in which the small-world property is generated through rewiring bonds of a onedimensional chain of sites, a newman-watts small-world network may be constructed as follows: for each nearest-neighbor bond, a long-distance link (or `short-cut') is added with probability ߮ between randomly chosen pairs of vertices. as illustrated in figure s2 (supplemental information), the resulting network features 2 ߮ ܮ ଶ long-distance links, with mean coordination number < ݇ > = 4 ሺ1 + ߮ሻ. again, each vertex may be in either of the states ܵ, ,ܫ ܴ, or empty, and each individual can hop to another site along any (nearest-neighbor or long-distance) link with a total diffusivity ݀. a typical snapshot of the sir model on this small-world architecture is shown in figure s2 (supplemental information). the unmitigated simulation parameters are: ܮ = 1,000, ܰ = 100,000, ܫሺ0ሻ = 100, ݀ = 1, and ߮ = 0.6. the presence of long-range links increases the mean connectivity, rendering the population more mixed, which in turn significantly facilitates epidemic outbreaks (see figure s4 in the supplemental information). we remark that for the sir dynamics, the newman-watts smallworld network effectively interpolates between a regular two-dimensional lattice and a scale-free network dominated by massively connected hubs; moreover, as the hopping probability ݀ → 0, the small-world network is effectively rendered static. in the two-dimensional small-world network, we may introduce social-distancing measures through two distinct means: i) we can globally diminish mobility by adopting a reduced overall diffusivity ݀ ᇱ < 1; and/or ii) we can drastically reduce the probability of utilizing a long-distance connection to ݀ ఝ ≪ 1. we have found that the latter mitigation strategy of curtailing the infection short-cuts into distant regions has a far superior effect. therefore, in figure 3 we display the resulting data for such a scenario where we set ݀ ఝ = 0.05, yet kept the diffusivity unaltered at ݀ = 1; as before, this control was triggered once ܫሺݐሻ = 0.1 ܰ had been reached in the course of the epidemic. the resurgence peak height and growth rate become even more stringently reduced with extended mitigation duration than for (distinct) social distancing measures implemented on the regular lattice. finally, we run the stochastic sir kinetics on two different static structures, namely i) randomly connected and ii) scale-free contact networks. each network link may be in either the ܵ, ,ܫ or ܴ configurations, which are subject to the sir reaction rules, but we do not allow movement among the network vertices. for the random network, we uniformly distribute 1,000,000 edges among ܰ = 100,000 nodes; this yields a poisson distribution for the connectivity with preset mean (equal to the variance) < ݇ > = ሺ∆݇ሻ ଶ = 20. for the scale-free network, we employ the barabasi-albert graph construction (21) , where each new node is added successively with ݇ = 4 edges, to yield a total of 799,980 edges. the connectivity properties in these quite distinct architectures are vastly different, since the scale-free networks feature prominent `hubs' through which many other nodes are linked. in the epidemic context, these hubs represent super-spreader centers through which a large fraction of the population may become infected (8, 22) . to implement the stochastic sir dynamics on either contact network, we employ the efficient rejection-free gillespie dynamical monte carlo algorithm: each reaction occurs successively, but the corresponding time duration between subsequent events is computed from the associated probability function (23) (for details, see supplemental information). the random social network may be considered an emulation of the well-connected mean-field model. indeed, we obtain excellent agreement for the temporal evolution of the sir kinetics in these two systems with ܽ = 0.15 mcs (for the scale-free network, a small adjustment to an effective mean-field recovery rate ܽ ≈ 0.18 mcs is required). we implement a `complete lockdown' mitigation strategy: once the threshold ܫሺݐሻ = 0.1 ܰ has been reached, we immediately cut all links for a subsequent duration ܶ; during that time interval, only spontaneous recovery ܫ → ܴ can occur. in figure 4 , we discern a markedly stronger impact of this lockdown on the intensity of the epidemic resurgence in both these static contact network architectures, see also figure 6a below. on the other hand, the mitigation duration influences the second infection wave less strongly, with the time until its peak has been reached growing only linearly with ܶ: ߬ሺܶሻ ~ ܶ, as is visible in figure 6b . there is however a sharp descent in resurgent peak height beyond an apparent threshold ܶ > 7/ܽ for the random network, and ܶ > 8/ܽ for the scale-free network. for both the two-dimensional regular lattice and small-world structure, a similar sudden drop in the total number of infected individuals ( figure 6b ) requires a considerably longer mitigation duration: in these dynamical networks, the repopulation of nodes with infective individuals facilitates disease spreading, thereby diminishing control efficacy. we remark that if a drastically reduced diffusivity ݀ ᇱ ≪ 1 is implemented, the small-world results closely resemble those for a randomly connected contact network ( figure 6a ). moreover, we have observed an unexpected and drastic effective structural change in the scalefree network topology as a consequence of the epidemic outbreak infecting its susceptible nodes. naturally, the highly connected hubs are quickly affected, and through transitioning to the recovered state, become neutralized in further spreading the disease. as shown in figure 5 , as the infection sweeps through the network (in the absence of any lockdown mitigation), the distribution of the remaining active susceptible-infectious (si) links remarkably changes from the initial scale-free power law with exponent −1/2 to a more uniform, almost randomized network structure. the disease resurgence wave thus encounters a very different network topology than the original outbreak. in this study, we implemented social distancing control measures for simple stochastic sir epidemic models on regular square lattices with diffusive spreading, two-dimensional newman-watts small-world networks that include highly infective long-distance connections, and static contact networks, either with random connectivity or scale-free topology. in these distinct architectures, all disease spreading mitigation measures, be that through reduced mobility and/or curtailed connectivity, must of course be implemented at an early outbreak stage, but also maintained for a sufficient duration to be effective. in figure 6 , we compare salient features of the inevitable epidemic resurgence subsequent to the elimination of social distancing restrictions, namely the asymptotic fraction ܴ ஶ /ܰ of recovered individuals, i.e., the integrated number of infected individuals; and the time ߬ሺܶሻ that elapses between the release and the peak of the second infection wave, both as function of the mitigation duration ܶ. we find that the latter grows exponentially with ܶ on both dynamical lattice architectures, but only linearly on the static networks ( figure 6b) . furthermore, as one would expect, the mean-field rate equations pertaining to a fully connected system describe the randomly connected network very well. in stark contrast to the mean-field results (indicated by the purple lines in figure 6 ), the data for the lattice and network architectures reveal marked correlation effects that emerge at sufficiently long mitigation durations ܶ. for ܶ > 8/ܽ in the static networks, and ܶ > 12/ܽ in the lattice structures, the count of remaining infectious individuals ܫ becomes quite low; importantly, these are also concentrated in the vicinity of a few persisting infection centers. this leads to a steep drop in ܴ ஶ /ܰ , the total fraction of ever infected individuals, by a factor of about 4 in the static network, and 3 in the dynamic lattice architectures. thus, in these instances, follow-up disease control measures driven by high-fidelity testing and efficient contact tracking should be capable of effectively eradicating the few isolated disease resurgence centers. however, to reach these favorable configurations for the implementation of localized and targeted epidemic control, it is imperative to maintain the original social-distancing restrictions for at least a factor of three (better four) longer than it would have taken the unmitigated outbreak to reach its peak (ܶ ≈ 3/ܽ … 6/ܽ in our simulations) -for covid-19 that would correspond to about two months. as is evident from our results for two-dimensional small-world networks that perhaps best represent human interactions, it is also absolutely crucial to severely limit all far-ranging links between groups to less than 5 % of the overall connections, during the disease outbreak. the graphs compare the outbreak data obtained without any mitigation (grey) and with social distancing measures implemented for different durations ܶ, as indicated. in all cases, social distancing is turned on once ܫሺݐሻ reaches the set threshold of 10 % of the total population ܰ. the resurgent outbreak is drastically reduced in both its intensity and growth rate as social distancing is maintained for longer time periods ܶ. (the data for each curve were averaged over 100 independent realizations; the shading indicated statistical error estimates.) inset: time ߬ to reach the second peak following the end of the mitigation; the data indicate an exponential increase of ߬ with ܶ. square lattices with diffusive spreading on our regular square lattice with ܮ ଶ sites set on a two-dimensional torus, we implement the stochastic susceptible-infectious-recovered (sir) epidemic model with the following individualbased monte carlo algorithm: 1. randomly distribute ܰ individuals on the lattice, subject to the restriction that each site may only contain at most one individual, and with period boundary conditions. some small fraction of the individuals will initially be infectious, while the remainder of the population will be susceptible to the infection. 2. perform random sequential updates ܮ ଶ times in one monte carlo step (mcs) by picking a lattice site at random, and then performing the following actions: a. if the selected site contains a susceptible ܵ or a recovered individual ܴ, a hopping direction is picked randomly. if the adjacent lattice site in the hopping direction is empty, then the chosen individual is moved to that neighboring site with hopping probability ݀ that is related to a macroscopic diffusion rate. b. if the chosen lattice site contains an infectious individual ,ܫ it will first try to infect each susceptible nearest neighbor ܵ with a prescribed infection probability .ݎ if this attempt is successful, the involved susceptible neighbor ܵ immediately changes its state to infected .ܫ after the originally selected infected individual has repeated its infection attempts with all neighboring susceptibles ܵ, it may reach the immune state ܴ with recovery probability ܽ. finally, this particular individual, whether still infectious or recovered, tries to hop in a randomly picked direction with probability ݀, provided the chosen adjacent lattice site is empty. 3. repeat the procedures in item 2 for a preselected total number of monte carlo steps. to determine the effective (coarse-grained) basic epidemic reproduction ratio ܴ , we fit the infection curves to straightforward numerical integrations of the deterministic sir rate equations ݀ܵሺݐሻ/݀ݐ = ݎ− ܵሺݐሻ ܫሺݐሻ/ܰ, ݐ݀/ሻݐሺܫ݀ = ݎ ܵሺݐሻ ܫሺݐሻ/ܰ − ܽ ܫሺݐሻ, ܴ݀ሺݐሻ/݀ݐ = ܽ ܫሺݐሻ, and adjust the lattice simulation infection probability ݎ ≈ 1.0 and to a lesser extent, the recovery probability ܽ to finally match the targeted covid-19 value ܴ ≈ 2.4. we note that this slightly `renormalized' value for ܽ is subsequently utilized to set the time axis scale in the figures. on the mean-field level, initially ܴ = ሺܽ/ݎሻ ܵሺ0ሻ/ܰ, since all nodes are mutually connected. in spatial settings, ܵሺ0ሻ/ܰ is to be replaced with the mean connectivity (i.e., the coordination number for a regular lattice) to susceptible individuals. the lattice simulation data is fitted with the mean-field result by matching two parameters: the maximum value and the half-peak width of the infectious population curve ܫሺݐሻ, see figure s1 . the lattice simulation curve digresses from the mean-field curves at low ܫሺݐሻ values, far away from the peak region. in the lattice simulations, the initial rise of the infectious population curve exhibits power-law growths ܫሺݐሻ ~ ݐ ଵ.ସ±.ଵ and ܴሺݐሻ ~ ݐ ଶ.ଷ±.ଵ in clear contrast with the simple exponential rise of the mean-field sir curve as obtained from integrating the mean-field rate equations. we note that these are the standard critical exponents ߠ and 1 + ߠ for the temporal growth of an active seed cluster near a continuous non-equilibrium phase transition to an absorbing extinction state (11) . figures s2 about here. figure s2 shows the dependence of the asymptotic number of recovered individuals ܴ ஶ on the density ߩ for various sets of hopping rates ݀ and initial infectious population values ܫሺ0ሻ. these data indicate the existence of a well-defined epidemic threshold, i.e., a percolation-like sharp transition from a state when only a tiny fraction of individuals is infected, to the epidemic state wherein the infection spreads over the entire population (11) . as one would expect, this critical point depends only on the ratio ܽ/݀ of the recovery and hopping rates. varying the lattice simulation parameters just shifts the location of the epidemic threshold. once the model parameters are set in the epidemic spreading regime, the system's qualitative behavior is thus generic and robust, and only weakly depends on precise parameter settings. for our two-dimensional small-world network, whose construction is schematically depicted in figure s3 , we employ a similar monte carlo algorithm as described above; the essential difference is that individuals may now move to adjacent nearest-neighbor as well as to distant lattice sites along the pre-set `short-cut' links. figure s4a demonstrates (for fixed diffusivity ݀ = 1) that as function of the fraction ߮ of long-distance links in a two-dimensional small-world network, the epidemic threshold resides quite close to zero: the presence of a mere few `shortcuts' in the lattice already implies a substantial population mixing. the inset, where the ߮ axis is scaled logarithmically, indicates that sizeable outbreaks begin for ߮ ≥ 0.05. figure s4b similarly shows the outbreak dependence on the diffusion rate ݀ (here for ߮ = 0.6), with the threshold for epidemic spreading observed at ݀ ≈ 0.3. evidently, prevention of disease outbreaks in this architecture requires that both mobility and the presence of far-ranging connections be stringently curtailed. figure s4 about here. for both the randomly connected and scale-free contact networks, we employ the gillespie or dynamical monte carlo algorithm, which allows for efficient numerical simulations of markovian stochastic processes. it consists of these subsequent steps: 1. initially, few nodes are assigned to be infected ,ܫ while all other nodes are set in the susceptible state ܵ. each susceptible node ܵ is characterized by a certain number of active links that are connected to infected nodes .ܫ 2. we then determine the rate at which each infected node ܫ will recover, and at which each susceptible node ܵ with a non-zero number of active links becomes infected. from these we infer the total event rate ݎ ௧௧ . 3. based on this total rate ݎ ௧௧ , we select the waiting time until the next event occurs from an exponential distribution with mean ݎ ௧௧ . 4. we then select any permissible event with a probability proportional to its rate, update the status of each node, and repeat these processes for the desired total number of iterations. strategies for mitigating an influenza pandemic modeling targeted layered containment of an influenza pandemic in the united states forecasting covid-19 impact on hospital bed-days, icu-days, ventilatordays and deaths by us state in the next 4 months impact of non-pharmaceutical interventions (npis) to reduce covid-19 mortality and healthcare demand evaluating the impact of international airline suspensions on the early global spread of covid-19 the hidden geometry of complex, network-driven contagion phenomena a lattice model for influenza spreading networks and epidemic models a contribution to the mathematical theory of epidemics mathematical biology, vols. i + ii critical dynamics -a field theory approach to equilibrium and non-equilibrium scaling behavior chemical kinetics: beyond the textbook impact of non-pharmaceutical interventions (npis) to reduce covid-19 mortality and healthcare demand generalized logistic growth modeling of the covid-19 outbreak in 29 provinces in china and in the rest of the world controlling epidemic spread by social distancing: do it well or not at all statistical mechanics of driven diffusive systems nonequilibrium phase transitions in lattice models scaling and percolation in the small-world network model percolation and epidemics in a two-dimensional small world collective dynamics of 'small-world' networks topology of evolving networks: local events and universality reasoning about a highly connected world temporal gillespie algorithm: fast simulation of contagion processes on time-varying networks the full simulation movie files are research was sponsored by the u.s. army research office and was accomplished under grant no. w911nf-17-1-0156. the views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the army research office or the u.s. government. the u.s. government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation herein. s.d. gratefully acknowledges a fellowship from the china scholarship council, grant no. csc 201806770029. key: cord-288342-i37v602u authors: wang, zhen; andrews, michael a.; wu, zhi-xi; wang, lin; bauch, chris t. title: coupled disease–behavior dynamics on complex networks: a review date: 2015-07-08 journal: phys life rev doi: 10.1016/j.plrev.2015.07.006 sha: doc_id: 288342 cord_uid: i37v602u it is increasingly recognized that a key component of successful infection control efforts is understanding the complex, two-way interaction between disease dynamics and human behavioral and social dynamics. human behavior such as contact precautions and social distancing clearly influence disease prevalence, but disease prevalence can in turn alter human behavior, forming a coupled, nonlinear system. moreover, in many cases, the spatial structure of the population cannot be ignored, such that social and behavioral processes and/or transmission of infection must be represented with complex networks. research on studying coupled disease–behavior dynamics in complex networks in particular is growing rapidly, and frequently makes use of analysis methods and concepts from statistical physics. here, we review some of the growing literature in this area. we contrast network-based approaches to homogeneous-mixing approaches, point out how their predictions differ, and describe the rich and often surprising behavior of disease–behavior dynamics on complex networks, and compare them to processes in statistical physics. we discuss how these models can capture the dynamics that characterize many real-world scenarios, thereby suggesting ways that policy makers can better design effective prevention strategies. we also describe the growing sources of digital data that are facilitating research in this area. finally, we suggest pitfalls which might be faced by researchers in the field, and we suggest several ways in which the field could move forward in the coming years. infectious diseases have long caused enormous morbidity and mortality in human populations. one of the most devastating examples is the black death, which killed 75 to 200 million people in the medieval period [1] . currently, the rapid spread of infectious diseases still imposes a considerable burden [2] . to elucidate transmission processes of infectious diseases, mathematical modeling has become a fruitful framework [3] . in the classical modeling framework, a homogeneously mixed population can be classified into several compartments according to disease status. in particular, the most common compartments are those that contain susceptible individuals (s), infectious (or infected) individuals (i), and recovered (and immune) individuals (r). using these states, systems of ordinary differential equations (odes) can be created to capture the evolution of diseases with different natural histories. for example, a disease with no immunity where susceptible individuals who become infected return to the susceptible class after recovering (sis natural history, see fig. 1 where [s] ([i ]) represents the number of susceptible (infectious) individuals in the population, β is the transmission rate of the disease, and μ is the recovery rate of infected individuals. some diseases, however, may give immunity to individuals who have recovered from infection (sir natural history, see fig. 1 where [r] is the number of recovered (and immune) individuals. in these ode models, a general measure of disease severity is the basic reproductive number r 0 = βn/μ, where n is the population size. in simple terms, r 0 is the mean number of secondary infections caused by a single infectious individual, during its entire infectious period, in an otherwise susceptible population [4] . if r 0 < 1, the disease will not survive in the population. however, if r 0 > 1, the disease may be able to persist. typically, parameters like the transmission rate and recovery rate are treated as fixed. however, new approaches to modeling have been developed in past few decades to address some of the limitations of the classic differential equation framework that stem from its simplifying assumptions. for instance, the impact of behavioral changes in response to an epidemic is usually ignored in these formulations (e.g., the transmission rate is fixed), but in reality, individuals usually change their behavior during an outbreak according to the change of perceived infection risk, and their behavioral decisions can in turn impact the transmission of infection. another limitation of the classical compartmental models is the assumption of well-mixed populations (namely, individuals interact with all others at the same contact rate), which thus neglects heterogeneous spatial contact patterns that can arise in realistic populations. in this review we will describe how models of the past few decades have begun to address these limitations of the classic framework. traditionally, infectious disease models have treated human behavior as a fixed phenomenon that does not respond to disease dynamics or any other natural dynamics. for many research questions, this is a useful and acceptable simplification. however, in other cases, human behavior responds to disease dynamics, and in turn disease dynamics responds to human behavior. for example, the initiation of an epidemic may cause a flood of awareness in the population such that protective measures are adopted. this in turn, reduces the transmission of the disease. in such cases, it becomes possible to speak of a single, coupled "disease-behavior" system where a human subsystem and a disease schematic illustration of disease-behavior interactions as a negative feedback loop. in this example, the loop from disease dynamics to behavioral dynamics is positive (+) since an increase in disease prevalence will cause an increase in perceived risk and thus an increase in protective behaviors. the loop from behavioral dynamics back to disease dynamics is negative (−) since an increase in protective behaviors such as contact precautions and social distancing will generally suppress disease prevalence. transmission subsystem are coupled to one another (see fig. 2 ). moreover, because the human and natural subsystems are themselves typically nonlinear, the coupled system is therefore also typically nonlinear. this means that phenomena can emerge that cannot be predicted by considering each subsystem in isolation. for example, protective behavior on the part of humans may ebb and flow according to disease incidence and according to a characteristic timescale (as opposed to being constant over time, as would occur in the uncoupled subsystems). to explore strategic interactions between individual behaviors, game theory has become a key tool across many disciplines. it provides a unified framework for decision-making, where the participating players in a conflict must make strategy choices that potentially affect the interest of other players. game theory and its corresponding equilibrium concepts, such as the nash equilibrium, emerged in seminal works from the 1940s and 1950s [5, 6] . a nash equilibrium is a set of strategies such that no player has an incentive to unilaterally deviate from the present strategy. that is, the nash equilibrium makes strategies form best responses to one other, since every player, who has a consistent goal to maximize his own benefit or utility, is perfectly rational. game theory has been applied to fields such as economics, biology, mathematics, public health, ecology, traffic engineering, and computer science [7] [8] [9] [10] [11] [12] . for example, in voluntary vaccination programs, the formal theory of games can be employed as a framework to analyze the vaccination equilibrium level in populations [9, 13, 14] . in the context of vaccination, the feedback between individual decisions of vaccination (or other prevention behaviors) and disease spreading is captured, hence these systems exemplify coupled disease-behavior systems. in spite of the great progress of game theory, the classical paradigm still shows its limitations in many scenarios. it thus becomes instructive to relax some key assumptions, such as the introduction of bounded rationality. game theory has been extended into evolutionary biology, which has generated great insight into the evolution of strategies [15] [16] [17] [18] [19] under both biological and cultural evolution. for instance, the replicator equation, which consists of sets of differential equations describing how the strategies of a population evolve over time under selective pressures, has also been used to study learning in various scenarios [20] . except for temporal concepts, spatial interaction topology has also proved to be crucial in determining system equilibria (also see refs. [16, 17] for a comprehensive overview). evolutionary game theory has been extensively applied to behavioral epidemiology, whose details will be surveyed in the following sections. several methods from statistical physics have become useful in the study of disease-behavior interactions on complex networks. most populations are spatially structured in the sense that individuals preferentially interact with those who share close geographic proximity. perhaps, the most simple population structure is a regular lattice: all the agents are assigned specific locations on it, normally a two-dimensional square lattice, just like atoms in crystal lattice sites, which interact with only nearest neighbors. in a regular lattice population, each individual meets the same people they interact with regularly, rather than being randomly reshuffled into a homogeneous mixture, as in well-mixed population models. in addition, another type of homogeneous network attracting great research interest is the erdös-rényi (er) graph [21] , which is a graph where nodes are linked up randomly and which is often used in the rigorous analysis of graphs and networks. however, in reality, there is ubiquitous heterogeneity in the number of contacts per individual, and recent studies have shown that the distribution of contact numbers of some social networks is not homogeneous but appears to follow a power-law [22] . moreover, social contact networks also display small-world properties (i.e., short average path length between any two individuals and strong local clustering tendency), which cannot be well described by regular lattices or random graphs [23] . with both motivations, two significant milestones were born in the late 90s: the theoretical models of small-world (sw) networks and scale-free (sf) networks [24, 25] . subsequently, more properties of social networks have been extensively investigated, such as community structure (a kind of assortative structure where individuals are divided into groups such that the members within each group are mostly connected with each other) [26] , clusters [27] , and the recent proposal of multilayer as well as time-varying frameworks [28] [29] [30] [31] [32] . due to the broad applicability of complex networks, network models have been widely employed in epidemiology to study the spread of infectious diseases [27] . in networks, a vertex represents an individual and an edge between two vertices represents a contact over which disease transmission may occur. an epidemic spreads through the network from infected to susceptible vertices. with the advent of various network algorithms, it becomes instructive incorporating disease dynamics into such infrastructures to explore the impact of spatial contact patterns [33] [34] [35] [36] [37] [38] . replacing the homogeneous mixing hypothesis that any individual can come into contact with any other agents, networked epidemic research assumes that each individual has comparable number of contacts, denoted by its degree k. under this treatment, the most significant physics finding is that network topology will directly determine the threshold of epidemic outbreak and phase transition. for example, compared with the finite epidemic threshold of random network, romualdo et al. found that disease with sis dynamics and even a very small transmission rate can spread and persist in the sf networks (i.e., there is absence of a disease threshold) [39] . this point helps to explain why it is extremely difficult to eradicate viruses on internet and world wide web, and why those viruses have an unusual long lifetime. but the absence of epidemic threshold is only suitable for sf networks with a power-law degree distribution p (k) ∼ k −γ with γ ∈ (2, 3]. if γ is extended to the range (3, 4) , an anomalous critical behavior takes place [39, 40] . to show the condition of disease spread, it is meaningful to define the relative spreading rate λ ≡ β/μ. the larger is λ, the more likely the disease will spread. generally, for an sf network with arbitrary degree distribution, the epidemic threshold is in particular, for an sf network k 2 diverges in the n → ∞ limit, and so the epidemic threshold is expected to die out. similarly, it is easy to derive the threshold of sir model which is related with average degree k and the second moment k 2 of networks as well. along these findings, more endeavors are devoted to the epidemic threshold of spatial networks with various properties, such as degree correlation [41, 42] , sw topology [23] , community structure [43] , and k-core [44] . on the other hand, more analysis and prediction methods (such as mean-field method, generation function) are also proposed to explain the transition of disease on realistic networks [27, 45] and immunization strategies of spatial networks are largely identified [46] . to illustrate the meaning of studying disease-behavior dynamics on complex networks, it is instructive to firstly describe a simple example of such a system. consider a population of individuals who are aware of a spreading epidemic. the information each individual receives regarding the disease status of others is derived from the underlying social network of the population. these networks have been shown to display heterogeneous contact patterns, where the node degree distribution often follows a power-law fashion [47, 48] . it is possible to use these complex network patterns to model a realistic population that exhibits adaptive self-protective behavior in the presence of a disease. a common way to incorporate this self-protective behavior is to allow individuals to lower their susceptibility according to the proportion of their contacts that are infectious, as demonstrated by bagnoli et al. [49] . in this model, the authors reduce the susceptibility of an individual to a disease which has a simple sis natural history by multiplying the transmission rate by a negative exponential function of the proportion of their neighbors who are infectious. specifically, this is given by βi (ψ, k) where β is the per contact transmission probability and models the effect an individual's risk perception has on its susceptibility, where j and τ are constants that govern the level of precaution individuals take, ψ is the number of infectious contacts an individual has, and k is the total number of contacts an individual has. the authors show that the introduction of adaptive behavior has the potential to not only reduce the probabilities of new infections occurring in highly disease-concentrated areas, but can also cause epidemics to go extinct. specifically, when τ = 1, there is a value of j for which an epidemic can be stopped in regular lattices and sw networks [25] . however, for certain sf networks, there is no value of j that is able to stop the disease from spreading. in order to achieve disease extinction in these networks, hub nodes must adopt additional self-protective measure, which is accomplished by decreasing τ for these individuals. the conclusions derived from this model highlight the significant impact different types of complex networks can have on health outcomes in a population, and how behavioral changes can dictate the course of an epidemic. the remainder of this review is organized as follows. in section 2, we will focus on the disease-behavior dynamics of homogeneously mixed populations, and discuss when the homogeneous mixing approximation is or is not valid. this provides a comprehensive prologue to the overview of the coupled systems on networks in section 3. within the latter, we separately review dynamics in different types of networked populations, which are frequently viewed through the lens of physical phenomena (such as phase transitions and pattern formation) and analyzed with physicsbased methods (like monte carlo simulation, mean-field prediction). based on all these achievements, we can capture how coupled disease-behavior dynamics affects disease transmission and spatial contact patterns. section 4 will be devoted to empirical concerns, such as types of data that can be used for these study systems, and how questionnaires and digital equipment can be used to collect data on relevant social and contact networks. in addition, it is meaningful to examine whether some social behaviors predicted by models really exist in vaccination experiments and surveys. finally, we will conclude with a summary and an outlook in section 5, describing the implications of statistical physics of spatial disease-behavior dynamics and outlining viable directions for future research. throughout, we will generally focus on preventive measures other than vaccination (such as social distancing and hand washing), although we will also touch upon vaccination in a few places. a large body of literature addresses disease-behavior dynamics in populations that are assumed to be mixing homogeneously, and thus spatial structure can be neglected. incorporating adaptive behavior into a model of disease spread can provide important insight into population health outcomes, as the activation of social distancing and other nonpharmaceutical interventions (npis) have been observed to have the ability to alter the course of an epidemic [50] [51] [52] . table 1 disease-behavior models applied to well-mixed populations, classified by infection type and whether economic-based or rule-based. when making decisions regarding self-protection from an infection, individuals must gather information relevant to the disease status of others in the population. prophylactic behavior can be driven by disease prevalence, imitation of others around them, or personal beliefs of probable health outcomes. in this section, we will survey the features and results of mathematical models that incorporate prophylactic decision making behavior in homogeneously mixed populations. the approaches we consider can be classified into two separate categories: economic-based and rulebased. economic based models (such as game theoretical models) assume individuals seek a maximization of their social utility, whereas rule-based models prescribe prevalence-based rules (not explicitly based on utility) according to which individuals and populations behave. both of these methods can also be used to study the dynamics of similar diseases (see table 1 ), and are discussed in detail below. the discovery of human immunodeficiency virus (hiv)/acquired immune deficiency syndrome (aids) and its large economic impacts stimulated research into behaviorally based mathematical models of sexually transmitted diseases (stds). in disease-behavior models, a population often initiates a behavior change in response to an increasing prevalence of a disease. in the context of stds, this change in behavior may include safer sex practices, or a reduction in the number of partnerships individuals seek out. following this prevalence-based decision making principle, researchers have used the concept of utility maximization to study the behavior dynamics of a population [53] [54] [55] [56] [57] . in these models, individuals seek to maximize their utility by solving dynamic optimization problems. utility is derived by members of the population when engaging in increased levels of social contact. however, this increased contact or partner change rate also increases the chance of becoming infected. one consequence of this dynamic is that higher levels of prevalence can result in increased prophylactic behavior, which in turn decreases the prevalence over time. as this occurs, self-protective measures used by the population will also fall, which may cause disease cycles [53, 56] . nonetheless, in the case of stds which share similar transmission pathways, a population protecting themselves from one disease by reducing contact rates can also indirectly protect themselves from another disease simultaneously [53] . in general, the lowering of contact rates in response to an epidemic can reduce its size, and also delay new infections [57] . however, this observed reduction of contact rates may not be uniform across the whole population. for example, an increase in prevalence may cause the activity rates of those with already low social interaction to fall even further, but this effect may not hold true for those with high activity rates [54] . in fact, the high-risk members of the population will gain a larger fraction of high-risk partners in this scenario, resulting from the low-risk members reducing their social interaction rates. this dynamic serves to increase the risk of infection of high activity individuals even further. these utility-based economic models show us that when considering health outcomes, one must be acutely aware of the welfare costs associated with self-protective behavior or implementing disease mitigation policies [56] . a health policy, such as encouraging infectious individuals to self-quarantine, may actually cause a rise in disease prevalence due to susceptible individuals feeling less threatened by infection and subsequently abandoning their own self-protective behavior [56] . also, a population who is given a pessimistic outlook of an epidemic may in fact cause the disease to spread more rapidly [55] . recently, approaches using game theory have been applied to self-protective behavior and social distancing [58] [59] [60] . when an individual's risk of becoming infected only depends on their personal investment into social distancing, prophylactic behavior is not initiated until after an epidemic begins, and ceases before an epidemic ends. also, the basic reproductive number of a disease must exceed a certain threshold for individuals to feel self-protective behavior is worth the effort [58] . in scenarios where the contact rate of the population increases with the number of people out in public, a nash equilibrium exists, but the level of self-protective behavior in it is not socially optimal [59] . nonetheless, these models also show that the activation of social distancing can weaken an epidemic. some models of disease-behavior dynamics, rather than assuming humans are attempting to optimize a utility function, represent human behavior by specifying rules that humans follow under certain conditions. these could include both phenomenological rules describing phenomenological responses to changes in prevalence, or more complex psychological mechanisms. rule-based compartmental models using systems of differential equations have also used to study heterogeneous behavior and the use of npis by a population during an epidemic. a wide range of diseases are modeled using this approach, such as hiv [61] [62] [63] , severe acute respiratory syndrome (sars) [64, 65] , or influenza [66, 63] . these models often utilize additional compartments, which are populated according to specific rules. examples of such rules are to construct the compartments to hold a constant amount of individuals associated with certain contact rates [61, 62, 67] , or to add and remove individuals at a constant rate [64, 65, 63, 68] , a rate depending on prevalence [69] [70] [71] [72] [73] [74] , or according to a framework where behavior that is more successful is imitated by others [75, 66, 76] . extra compartments signify behavioral heterogeneities amongst members of a population, and the disease transmission rates associated with them also vary. reduction in transmission due to adaptive behavior is either modeled as a quarantine of cases [64, 65, 63] , or prophylactic behavior of susceptible individuals due to increased awareness of the disease [75, 66, [69] [70] [71] [72] [73] [74] 77] . these models agree that early activation of isolation measures and selfprotective behavior can weaken an epidemic. however, due to an early decrease in new infections, populations may see a subsequent decrease in npi use causing multiple waves of infection [69, 75, 76, 71] . contrasting opinions on the impact behavioral changes have on the epidemic threshold also result from these models. for example, perra et al. [71] show that although infection size is reduced, prophylactic behavior does not alter the epidemic threshold. however, the models studied by poletti et al. [75] and sahneh et al. [70] show that the epidemic threshold can be altered by behavioral changes in a population. the classes of models presented in this section use homogeneous mixing patterns (i.e., well-mixed populations) to study the effects of adaptive behavior in response to epidemics and disease spread (see table 1 for a summary). often, populations will be modeled to alter their behavior based on reactions to changes in disease prevalence, or by optimizing their choices with respect to personal health outcomes. if possible, early activation of prophylactic behavior and npis by a population will be the most effective course of action to curb an epidemic. homogeneous mixing can be an appropriate approximation for the spread of an epidemic when the disease to be modeled is easily transmitted, such as measles and other infection that can be spread by fine aerosol particles that remain suspended for a long period. however, this mixing assumption does not always reflect real disease dynamics. for example, human sexual contact patterns are believed to be heterogeneous [48] and can be represented as networks (or graphs), while other infections, such as sars, can only be spread by large droplets, making the homogeneous mixing assumption less valid. the literature surrounding epidemic models that address this limitation by incorporating heterogeneous contact patterns through networks is very rich, and is discussed in the following section. in section 2, we reviewed disease-behavior dynamics in well-mixed populations. however, in real populations, various types of complex networks are ubiquitous and their dynamics have been well studied. the transmission of many infectious diseases requires direct or close contact between individuals, suggesting that complex networks play a vital role in diffusion of disease. it thus becomes of particular significance to review the development of behavioral epidemiology in networked populations. many of the dynamics exhibited by such systems have direct analogues to processes in statistical physics, such as how disease or behavior percolate through the network, or how a population can undergo a phase transition from one social state to another social state. perhaps the easiest way to begin studying disease-behavior dynamics in spatially distributed populations is by using lattices and static networks, which are relatively easy to analyze and which have attracted much attention in theoretical and empirical research. we organize research by several themes under which they have been conducted, such as the role of spreading awareness, social distancing as protection, and the role of imitation, although we emphasize that the distinctions are not always "hard and fast". the role of individual awareness. the awareness of disease outbreaks may stimulate humans to change their behavior, such as washing hands and wearing masks. such behavioral responses can reduce susceptibility to infection, which itself in turn can influence the epidemic course. in the seminal work, funk and coworkers [78] formulated and analyzed a mathematical model for the spread of awareness in well-mixed and spatially structured populations to understand how the awareness of disease and also its propagation impact the spatial spread of a disease. in their model, both disease and the information about the disease spread spontaneously by, respectively, contact and word of mouth in the population. the classical epidemiological sir model is used for epidemic spreading, and the information dynamics is governed by both information transmission and information fading. the immediate outcome of the awareness of the disease information is the decrease in the possibility of acquiring the infectious disease when a susceptible individual (who was aware of the epidemic) contacts with an infected one. in a well-mixed population, the authors found that, the coupling spreading dynamics of both the epidemic and the awareness of it can result in a lower size of the outbreak, yet it does not affect the epidemic threshold. however, in a population located on the triangular lattice, the behavioral response can completely stop a disease from spreading, provided the infection rate is below a threshold. specifically, the authors showed that the impact of locally spreading awareness is amplified if the social network of potential infection events and the communication network over which individuals communicate overlap, especially so if the networks have a high level of clustering. the finding that spatial structure can prevent an epidemic is echoed in an earlier model where the effects of awareness are limited to the immediate neighbors of infected nodes on a network [79] . in the model, individuals choose whether to accept ring vaccination depending on perceived disease risk due to infected neighbors. by exploring a range of network structures from the limit of homogeneous mixing to the limit of a static, random network with small neighborhood size, the authors show that it is easier to eradicate infections in spatially structured populations than in homogeneously mixing populations [79] . hence, free-riding on vaccine-generated herd immunity may be less of a problem for infectious diseases spreading spatially structured populations, such as would more closely describe the situation for close contact infections. along similar lines of research, wu et al. explored the impact of three forms of awareness on the epidemic spreading in a finite sf networked population [80] : contact awareness that increases with individual contact number; local awareness that increases with the fraction of infected contacts; and global awareness that increases with the overall disease prevalence. they found that the global awareness cannot decrease the likelihood of an epidemic outbreak while both the local awareness and the contact awareness do it. generally, individual awareness of an epidemic contributes toward the inhibition of its transmission. the universality of such conclusions (i.e., individual behavioral responses suppress epidemic spreading) is also supported by a recent model [81] , in which the authors focused on an epidemic response model where the individuals respond to the epidemic according to, rather than the density of infected nodes, the number of infected neighbors in the local neighborhood. mathematically, the local behavioral response is cast into the reduction factor (1 − θ) ψ in the contact rate of a susceptible node, where ψ is the number of infected neighbors and θ < 1 is a parameter characterizing the response strength of the individuals to the epidemic. by studying both sis and sir epidemiological models with the behavioral response rule in sf networks, they found that individual behavioral response can in general suppress epidemic spreading, due to crucial role played by the hub nodes who are more likely to adopt protective response to block the disease spreading path. in a somewhat different framework, how the diffusion of individual's crisis awareness affects the epidemic spreading is investigated in ref. [82] . in this work, the epidemiological sir model is linked with an information transmission process, whose diffusion dynamics is characterized by two parameters, say, the information creation rate ζ and the information sensitivity η. in particular, at each time step, ζ n packets will be generated and transferred in the network according to the shortest-path routing algorithm (n hither denotes the size of networks). when a packet is routed by an infected individual, its state is marked by infection. each individual determines whether or not to accept vaccine based on how many infected packets are received from immediate neighbors, and on how sensitive the individual response is to the information as well, weighed by the parameter η. the authors considered their "sir with information-driven vaccination" model on homogeneous er networks and heterogeneous sf networks, and found that the epidemic spreading can be significantly suppressed in both the homogeneous and heterogeneous networks provided that both ζ and η are relatively large. social distancing as a protection mechanism. infectious disease outbreaks may trigger various behavioral responses of individuals to take preventive measures, one of which is social distancing. valdez and coworkers have investigated the efficiency of social distancing in altering the epidemic dynamics and affecting the disease transmission process on er network, sf networks, as well as realistic social networks [83] . in their model, rather than the normally used link-rewiring process, an intermittent social distancing strategy is adopted to disturb the epidemic spreading process. particularly, based on local information, a susceptible individual is allowed to interrupt the contact with an infected individual with a probability σ and restore it after a fixed time t b , such that the underlying interaction network of the individuals remains unchanged. using the framework of percolation theory, the authors found that there exists a cutoff threshold σ c , whose value depends on the network topology (i.e., the extent of heterogeneity of the degree distribution), beyond which the epidemic phase disappears. the efficiency of the intermittent social distancing strategy in stopping the spread of diseases is owing to the emergent "susceptible herd behavior" among the population that protects a large fraction of susceptible individuals. impact of behavior imitation on vaccination coverage. vaccination is widely employed as an infection control measure. to explore the role of individual imitation behavior and population structure in vaccination, recent seminal work integrated an epidemiological process into a simple agent-based model of adaptive learning, where individuals use anecdotal evidence to estimate costs and benefits of vaccination [85] . under such a model, the disease-behavior dynamics is modeled as a two-stage process. the first stage is a public vaccination campaign, which occurs before any epidemic spreading. at this stage, each individual decides whether or not to vaccinate, and taking vaccine incurs a cost c v to the vaccinated individuals. the vaccine is risk-free and offers perfect protection against infection. the second stage is the disease transmission process, where the classic sir compartmental model is adopted. during the whole epidemic spreading process, those susceptible individuals who caught the disease incur an infection cost c i , which is usually assumed to be larger than the cost c v for vaccination. those unvaccinated individuals who remain healthy are free-riding off the vaccination efforts of others (i.e., no any cost), and they are indirectly protected by herd immunity. for simplicity, the authors rescale these costs by defining the relative cost of vaccination c = c v /c i (0 < c < 1) and c i = 1. as such, after each epidemic season, all the individuals will get some payoffs (equal to the negative value of corresponding costs) dependent on their vaccination strategies and also on whether they are infected or not, then they are allowed to change or keep their old strategies for the next season, depending on their current payoffs. the rule of thumb is that the strategy of a role model with higher payoff is more likely to be imitated. by doing so, each individual i randomly chooses another individual j from the neighborhood as role model, and imitates the behavior of j with the probability where p i and p j are, respectively, the payoffs of two involved individuals, and β (0 < β < ∞) denotes the strength of selection. this form of decision alternative is also known as the fermi law [16, 86] in physics. a finite value of β accounts for the fact that better performing individuals are readily imitated, although it is not impossible to imitate one agent performing worse, for example due to imperfect information or errors in decision making. the authors studied their coupled "disease-behavior" model in well-mixed populations, in square lattice populations, in random network populations, and in sf network populations, and found that population structure acts as a "double-edged sword" for public health: it can promote high levels of voluntary vaccination and herd immunity given that the cost for vaccination is not too large, but small increases in the cost beyond a certain threshold would cause vaccination to plummet, and infections to rise, more dramatically than in well-mixed populations. this research provides an example of how spatial structure does not always improve the chances of infection control, in disease-behavior systems. the symbols and lines correspond, respectively, to the simulation results and mean-field predictions (whose analytical framework is shown in appendix a). the parameter α determines just how seriously the peer pressure is considered in the decision making process of the individuals to taking vaccine. the figure is reproduced from [84] . in the similar vein, peer pressure among the populations is considered to clarify its impact on the decision-making process of vaccination, and then on the disease spreading [84] . in reality, whether or not to change behavior depends not only on the personal success of each individual, but also on the success and/or behavior of others. using this as motivation, the authors incorporated the impact of peer pressure into a susceptible-vaccinated-infected-recovered (svir) epidemiological model, where the propensity to adopt a particular vaccination strategy depends both on individual success as well as on the strategy-configuration of their neighbors. to be specific, the behavior imitation probability of individual i towards its immediate neighbor j (namely, eq. (6)) becomes where n i is the number of neighbors that have a different vaccination strategy than the individual i, and k i is the interaction degree of i, and the parameter α determines just how seriously the peer pressure is considered. under such a scenario, fig. 3 displays how vaccination and infection vary as a function of vaccine cost in er random graph. it is clear that plugging into the peer pressure also works as a "double-edged sword", which, on the one hand, strongly promotes vaccine uptake in the population when its cost is below a critical value, but, on the other hand, it may also strongly impede it if the critical value is exceeded. the reason is due to the fact that the presence of peer-pressure can facilitate cluster formation among the individuals, whose behaviors are inclined to conform to the majority of their neighbors, similar to the early report of cooperation behavior [88] . such behavioral conformity is found to expedite the spread of disease when the relative cost for vaccination is high enough, while promote the vaccine coverage in the opposite case. self-motivated strategies related with vaccination. generally, it is not so much the actual risk of being infected, as the perceived risk of infection, that will prompt humans to change their vaccination behavior. previous game-theoretic studies of vaccination behavior typically have often assumed that individuals react to the disease incidence with same responsive dynamics, i.e., the same formulas of calculating the perceived probability of infection. but that may not actually be the case. liu et al. proposed that a few will be "committed" to vaccination, perhaps because they have a low threshold for feeling at risk (or strongly held convictions), and they will want to be immunized as soon as they hear that someone is infected [87] . they studied how the presence of committed vaccinators, a small fraction of individuals who consistently hold the vaccinating strategy and are immune to influence, impacts the vaccination dynamics in well-mixed and spatially structured populations. the researchers showed that even a relatively small proportion of these agents (such as 5%) can significantly reduce the scale of an epidemic, as shown in fig. 4 . the effect is much stronger when all the individuals are uniformly distributed on a square lattice, as compared to the case of well-mixed population. their results suggested that those committed individuals can have a remarkable effect, acting as "steadfast role models" in the population to seed vaccine uptake in others while also disrupting the appearance of clusters of free-riders, which might otherwise seed the emergence of a global epidemic. one important message taken away from ref. [87] is that we might never guess what would happen by just looking at the decision-making rules alone, in particular when our choices will influence, and be influenced by, the choice of other people. another good example can be found in a recent work [89] , in which zhang et al. proposed an evolutionary epidemic game where individuals can choose their strategies as vaccination, self-protection or laissez faire, towards infectious diseases and adjust their strategies according to their neighbors' strategies and payoffs. the "disease-behavior" coupling dynamical process is similar to the one implemented by ref. [85] , where the sir epidemic spreading process and the strategy updating process succeed alternatively. by both stochastic simulations and theoretical analysis, the authors found a counter-intuitive phenomenon that a better condition (i.e., larger successful rate of self-protection) may unfortunately result in less system payoff. the trick is that, when the successful rate of self-protection increases, people become more speculative and less interested in vaccination. since a vaccinated individual brings the "externality" effect to the system: the individual's decision to vaccinate diminishes not only its own risk of infection, but also the risk for those people with whom the individual interacts, the reduction of vaccination can remarkably enhance the risk of infection. the observed counter-intuitive phenomenon is reminiscent of the well-known braess's paradox in traffic, where more roads may lead to more severe traffic congestion [90] . this work provides another interesting example analogous to braess's paradox, namely, a higher successful rate of self-protection may eventually enlarge the epidemic size and thus diminish positive health outcomes. this work raises a challenge to public health agencies regarding how to protect the population during an epidemic. the government should carefully consider how to distribute their resources and money between messages supporting vaccination, hospitalization, self-protection, and so on, since the outcome of policy largely depends on the complex interplay among the type of incentive, individual behavioral responses, and the intrinsic epidemic dynamics. in their further work [91] , the authors investigated the effects of two types of incentives strategies, partial-subsidy policy in which certain fraction of the cost of vaccination is offset, and free-subsidy policy in which donees are randomly selected and vaccinated at no cost on the epidemic control. through mean-field analysis and computations, they found that, under the partial-subsidy policy, the vaccination coverage depends monotonically on the sensitivity of individuals to payoff difference, but the dependence is non-monotonous for the free-subsidy policy. due to the role models of the donees for relatively irrational individuals and the unchanged strategies of the donees for rational individuals, the free-subsidy policy can in general lead to higher vaccination coverage. these findings substantiate, once again, that any disease-control policy should be exercised with extreme care: its success depends on the complex interplay among the intrinsic mathematical rules of epidemic spreading, governmental policies, and behavioral responses of individuals. as the above subsection shows, research on disease-behavior dynamics on networks has become one of the most fruitful realms of statistical physics and non-linear science, as well as shedding novel light on how to predict the impact of individual behavior on disease spread and prevention [92] [93] [94] 85, [95] [96] [97] [98] [99] 79] . however, in some scenarios, the simple hypothesis that individuals are connected to each other in the same infrastructure (namely, the so-called single-layer network in section 3.1) may generate overestimation or underestimation for the diffusion and prevention of disease, since agents can simultaneously be the elements of more than one network in most, yet not all, empirical systems [29, 28, 100] . in this sense, it seems constructive to go beyond the traditional single-layer network theory and propose a new architecture, which can incorporate the multiple roles or connections of individuals into an integrated framework. the multilayer networks, defined as the combination class of networks interrelated in a nontrivial way (usually by sharing nodes), have recently become a fundamental tool to quantitatively describe the interaction among network layers as well as between these constituents. an example of multilayer networks is visualized in fig. 5 [101] . a social network layer supports the social dynamics related to individual behavior and main prevention strategies (like vaccination); while the biological layer provides a platform for the spreading of biological disease. each individual is a node in both network layers. the coupled structure can generate more diverse outcomes than either isolated network, and could produce multiple (positive or negative) effects on the eradication of infection. because of the connection between layers, the dynamics of control measures in turn affects the trajectory of disease on biological network, and vice versa. under such a framework, which is composed of at least 2 different topology networks, nodes not only exchange information with their counterparts in other network(s) via inter-layer connections, but also diffuse infection with their neighbors through the intra-layer connections. subsequently, more theoretical algorithms and models, such as interdependent networks, multiplex networks and interconnected networks, have been proposed [102] [103] [104] . the broad applicability of multilayer networks and their success in providing insight into the structure and dynamics of realistic systems have thus generated considerable excitement [105] [106] [107] [108] [109] . of course, the study of disease-behavior dynamics in this framework is a young and rapidly evolving research area, which will be systematically surveyed in what follows. interplay between awareness and disease. as fig. 5 illustrates, different dynamical processes for the same set of nodes with different connection topologies for each process can be encapsulated in a multilayer structure (technically, these are referred to as multiplex networks [28, 29] ). aiming to explore the interrelation between social awareness and disease spreading, granell et al. recently incorporated information awareness into a disease model embedded in a multiplex network [110] , where the physical contact layer supports epidemic process and the virtual contact layer supports awareness diffusion. similar to sis model (where the s node can be infected with a transmission probability β, and the i node recovers with a certain rate μ), the awareness dynamics, composed of aware (a) and unaware (u) states, assumes that a node of state a may lose its awareness with probability δ, and re-obtains awareness in the probability ν. then, both processes can be coupled via the combinations of individual states: unaware-susceptible fig. 6 . transition probability trees of the combined states for coupled awareness-disease dynamics each time step in the multilayer networks. here aware (a) state can become unaware (u) with transition probability δ and of course re-obtains awareness with other probability. for disease, μ represents the transition probability from infected (i) to susceptible (s). there are thus four state combinations: aware-infected, (ai) aware-susceptible, (as) unaware-infected, (ui) and unaware-susceptible (us), and the transition of these combinations is controlled by probability r i , q a i and q u i . they respectively denote the transition probability from unaware to aware given by neighbors; transition probability from susceptible to infected, if node is aware, given by neighbors; and transition probability from susceptible to infected, if node is unaware, given by neighbors. we refer to [110] , from where this figure has been adapted, for further details. (us), aware-susceptible (as), and aware-infected (ai), which are also revealed by the transition probability trees in fig. 6 . using monte carlo simulations, the authors showed that the coupled dynamical processes change the onset of the epidemics and allow them to further capture the evolution of the epidemic threshold (depending on the structure and the interrelation with the awareness process), which can be accurately validated by the markov-chain approximation approach. more interestingly, they unveiled that the increase in transmission rate can lower the long-term disease incidence while raising the outbreak threshold of epidemic. in spite of great progress, the above-mentioned findings are based on two hypotheses: infected nodes become immediately aware, and aware individuals are completely immune to the infection. to capture more realistic scenarios, the authors relaxed both assumptions and introduced mass media that disseminates information to the entire system [111] . they found that the vaccine coverage of aware individuals and the mass media affect the critical relation between two competing processes. more importantly, the existence of mass media makes the metacritical point (where the critical onset of the epidemics starts) of ref. [110] disappear. furthermore, the social dynamics are further extended to an awareness cascade model [112] , during which agents exhibit herd-like behavior because they make decisions referring to the actions of other individuals. interestingly, it is found that a local awareness ratio (of unaware individuals becoming aware ones) approximating 0.5 has a two-stage effect on the epidemic threshold (i.e., an abrupt transition of epidemic threshold) and can cause different epidemic sizes, irrespective of the network structure. that is to say, when the local awareness ratio is in the range of [0, 0.5), the epidemic threshold is a fixed and larger value; however, in the range of [0.5, 1], threshold value becomes a fixed yet smaller value. as for the final epidemic size, its increasing speed for the interval [0, 0.5) is much slower than the speed when local awareness ratio lies in [0.5, 1]. these findings suggest a new way of understanding realistic contagions and their prevention. except for obtaining awareness from aware neighbors, self-awareness induced by infected neighbors is another scenario that currently attracts research attention [113] , where it is found that coupling such a dynamical process with disease spreading can lower the density of infection, but does not increase the epidemic threshold regardless of the information source. coupling between disease and preventive behaviors. thus far, many achievements have shown that considering simultaneous diffusion of disease and prevention measures on the same single-layer network is an effective method to evaluate the incidence and onset of disease [94, 85, [95] [96] [97] [98] 114, 79] . however, if both processes are coupled on the multilayer infrastructure, how does it affect the spreading and prevention of disease? inspired by this interesting question, ref. [115] suggested a conceptual framework, where two fully or partially coupled networks are employed, to transmit disease (an infection layer) and to channel individual decision of prevention behaviors (a communication layer). protection strategies considered include wearing facemasks, washing hands frequently, taking pharmaceutical drugs, and avoiding contact with sick people, which are the only means of control in situations where vaccines are not yet available. it is found that the structure of the infection network, rather than the communication network, has a dramatic influence on the transmission of disease and uptake of protective measures. in particular, during an influenza epidemic, the coupled model can lead to a lower infection rates, which indicates that single-layer models may overestimate disease transmission. in line with this finding, the author further extended the above setup into a triple coupled diffusion model (adding the information flow of disease on a new layer) through metropolitan social networks [116] . during an epidemic, these three diffusion dynamics interact with each other and form negative and positive feedback loop. compared with the empirical data, it is exhibited that this proposed model reasonably replicates the realistic trends of influenza spread and information propagation. the author pointed out that this model possesses the potential of developing into a virtual platform for health decision makers to test the efficiency of disease control measures in real populations. much previous work shows that behavior and spatial structure can suppress epidemic spreading. in contrast, other recent research using a multiplex network consisting of a disease transmission (dt) network and information propagation (ip) network through which vaccination strategy and individual health condition information can be communicated, finds that compared with the case of traditional single-layer network (namely, symmetric interaction), the multiplex architecture suppresses vaccination coverage and leads to more infection [117] . this phenomenon is caused by the sharp decline of small-degree vaccination nodes, whose number is usually more numerous in heterogeneous networks. similarly, wang et al. considered asymmetrical interplay between disease spreading and information diffusion in multilayer networks [118] . it is assumed that there exists different disease dynamics on communication layer and physical-contact layer, only where vaccination takes place. more specifically, the vaccination decision of the node in contact networks is not only related to the states of its intra-layer neighbors, but also depends on the counterpart node from communication layer. by means of numerous simulations and mean-field analysis, they found that, for uncorrelated coupling architecture, a disease outbreak in the contact layer induces an outbreak of disease in the communication layer, and information diffusion can effectively raise the epidemic threshold. however, the consideration of inter-layer correlation dramatically changes the onset of disease, but not the information threshold. dynamical networks play an important role in the incidence and onset of epidemics as well. along this line of research, the most commonly used approach is adaptive networks [119] [120] [121] [122] , where nodes frequently adjust their connections according to the environment or states of neighboring nodes. time-varying networks (also named temporal networks) provide another framework for the activity-driven changing of connection topology [31, 123, 32] . here, we briefly review the progress of disease-behavior dynamics on adaptive and time-varying networks. contact switching as potential protection strategy. in the adaptive viewpoint, the most straightforward method of avoiding contact with infective acquaintances amounts to breaking the links between susceptible and infective agents and constructing novel connections. along such lines, thilo et al. first proposed an adaptive scenario: a susceptible node is able to prune the infected link and rewire with a healthy agent with a certain probability [124] . the probability of switching can be regarded as a measurement of strength of the protection strategy. it is shown that different values of this probability give rise to various degree mixing patterns and degree distributions. based on the low-dimensional approximations, the authors also showed that their adaptive framework is able to predict novel dynamical features, such as bistability, hysteresis, and first order transitions, which are sufficiently robust against disease dynamics [125, 126] . in spite of great advances, the existing analytical methods cannot generally allow for accurate predictions about the simultaneous time evolution of disease and network topology. to overcome this limitation, vincent et al. further introduced an improved compartmental formalism, which proves that the initial conditions play a crucial role in disease spreading [127] . in the above examples, switching contact as a strategy has proven its effectiveness in controlling epidemic outbreak. however, in some realistic cases, the population information may be asymmetric, especially during the process of rewiring links. to relax this constraint, a new adaptive algorithm was recently suggested: an infection link can be pruned by either individual, who reconnects to a randomly selected member rather than susceptible agent (namely, the individual has no previous information on state of every other agent) [128, 129] . for example, ref. [129] showed that such a reconnection behavior can completely suppress the spreading of disease via continuous and discontinuous transitions, which is universally effective in more complex situations. besides the phenomena of oscillation and bistability, another dynamical feature, epidemic reemergence, also attracts great interest in a current study [122] , where susceptible individuals adaptively break connections with infected neighbors yet avoid being isolated in a growing network. under such operations, the authors observed that the number fig. 7 . panel (a) denotes the time course for the number of infected nodes when the network growth, the link-removal process, and isolation avoidance are simultaneously involved into the adaptive framework. it is clear that this mechanism creates the reemergence of epidemic, which dies out after several such repetitions. while for this interesting phenomenon, it is closely related with the formation of giant component of susceptible nodes. panel (b) shows the snapshot of the network topology of 5000th time step (before the next abrupt outbreak), when there is a giant component of susceptible nodes (yellow). however, the invasion of the infection individuals (red) makes the whole network split into many fragments, as shown by the snapshot of 5400th time step (after the explosion) in panel (c). we refer to [122] , from where this figure has been adapted, for further details. (for interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) of infected agents stays at a low level for a long time, and then suddenly erupts to high level before declining to a low level again. this process repeats several times until the final eradication of infection, as illustrated in fig. 7(a) . with regard to potential mechanism, it is actually related with the invasion of infected individuals to susceptible giant components. link-removal process can suppress disease spreading, which makes susceptible (infected) agents form giant components (small yet non-isolated clusters), as shown in fig. 7(b) . but, the entrance of new nodes may bring new infection risk to such giant components, which accelerates next outbreak of infection and network crashing again (see fig. 7(c) ). interestingly, this finding may help to explain the phenomenon of repeated epidemic explosions in real populations. now, if we carefully look back upon the above-cited bibliography, we will find a common feature: except for disease processes, the adaptive adjustment of individual connections finally changes the degree distribution of networks. an interesting question naturally poses itself: is there an adaptive scenario that preserves the degree distribution of networks? that is, each individual has a characteristic behavior: keeping total number of its neighbors constant. to fill up this gap, neighbor exchange model becomes a very useful tool [130] , where individual number of current neighbors remains fixed while the compositions or entities of those contacts change in time. similar to famous algorithm of watts-strogatz sw network [25] , such a model allows an exchange mechanism in which the destination nodes of two edges are swapped with a given rate. incorporating the diffusion of epidemic, this model constructs a bridge between static network model and mass-action model. based on the empirical data, the authors further displayed that the application of this model is very effective to forecast and control sexually transmitted disease outbreak. along this way, the potential influence of other topology properties (such as growing networks [131] and rewiring sf networks [132] ) has recently been identified in the adaptive viewpoint, which dramatically changes the outbreak threshold of disease. vaccination, immunization and quarantine as avoidance behaviors. as in static networks, vaccination can also be introduced into adaptive architectures, where connection adjustment is an individual response to the presence of infection risk among neighborhoods. motivated by realistic immunization situations, disease prevention is implemented by adding poisson-distributed vaccination to susceptible individuals [133] . because of the interplay between network rewiring and vaccination application, the authors showed that vaccination is far more effective in an adaptive network than a static one, irrespective of disease dynamics. similarly, some other control measures are further encapsulated into adaptive community networks [134] . except for various transition of community structure, both immunization and quarantine strategies show a counter-intuitive result that it is not "the earlier, the better" for prevention of disease. moreover, it is unveiled that the prevention efficiency of both measures is greatly different, and the optimal effect is obtained when a strong community structure exists. vaccination on time-varying networks. in contrast to the mutual feedback between dynamics and structure in adaptive frameworks, time-varying networks provide a novel angle for network research, where network connection and fig. 8 . vaccination coverage as a function of the relative cost of vaccination and the fraction of imitators in different networks. it is obvious that for small cost of vaccination, imitation behavior increases vaccination coverage but impedes vaccination at high cost, irrespective of potential interaction topology. the figure is reproduced from [93] . dynamics process evolve according to their respective rules [135] [136] [137] . for example, summin et al. recently explored how to lower the number of vaccinated people to protect the whole system on time-varying networks [138] . based on the past information, they could accurately administer vaccination and estimate disease outbreaks in future, which proves that time-varying structure can make protection protocols more efficient. in [139] , the authors displayed that limited information on the contact patterns is sufficient to design efficient immunization strategies once again. but in these two works, the vaccination strategy is somewhat independent of human behavior and decision-making process, which leaves an open issue: if realistic disease-behavior dynamics is introduced into time-varying topology (especially combining with the diffusion process of opinion cluster [140] ), how does it affect the eradication of disease? we continue to discuss some of these and similar issues in section 4 on empirically-derived networks. some research uses networks derived from empirical data in order to examine disease-behavior dynamics. we discuss these models in this subsection. dynamics on different topologies. heterogeneous contact topology is ubiquitous in reality. to test its potential impact on disease spreading, martial et al. recently integrated a behavior epidemiology model with decision-making process into three archetypical realistic networks: poisson network, urban network and power law network [93] . under these contact networks, an agent can make decision either based purely on payoff maximization or via imitating the vaccination behavior of its neighbor (as suggest by eq. (6)), which is controlled by the fraction of imitators . by means of numerous simulations, they displayed the diploid effect of imitation behavior: it enhances vaccination coverage for low vaccination cost, but impedes vaccination campaign at relatively high cost, which is depicted by fig. 8 . surprisingly, in spite of high vaccination coverage, imitation can generate the clusters of non-vaccinating, susceptible agents, which in turn accelerate the large-scale outbreak of infectious disease (namely, imitation behavior, to some extent, impedes the eradication of infectious diseases). this point helps to explain why outbreaks of measles have recently occurred in many countries with high overall vaccination coverage [140, 143, 144] . with the same social networks, ref. [141] explored the impact of heterogeneous contact patterns on disease outbreak in the compartmental model of sars. it is interesting that, compared with the prediction of well-mixed population, the same set of basic reproductive number may lead to completely epidemiological outcomes in any two processes, which sheds light to the heterogeneity of sars around the world. impact of network mixing patterns. as ref. [93] discovered, high vaccination coverage can guarantee herd immunity, which, however, is dramatically affected and even destroyed by clusters of unvaccinated individuals. to evaluate how much influence such clusters possess, a recent work explored the distribution of vaccinated agents during seasonal influenza vaccination through a united states high school contact network [142] . the authors found that contact table 2 classification of disease-behavior research outcomes according to dynamic characteristics in networked populations reviewed by section 3. it is clear that the same type of networks can be frequently used to different problems. table 3 observed physical phenomena and frequently used methods in the study of diseasebehavior dynamics on networks. epidemic threshold mean-field prediction phase transition generation function self-organization percolation theory pattern formation stochastic processes bifurcation and stability analysis monte carlo simulation vaccination/immunization threshold markov-chain approximation networks are positively assortative with vaccination behavior. that is to say, large-degree unvaccinated (vaccinated) agents are more likely to contact with other large-degree unvaccinated (vaccinated) ones, which certainly results in a larger outbreak than common networks since these (positively assortative) unvaccinated agents breed larger susceptible clusters. this finding highlights the importance of heterogeneity during vaccine uptake for the prevention of infectious disease once again. in fact, the currently growing available human generated data and computing power have driven the fast emergence of various social, technological and biological networks [145] [146] [147] [148] . upon these empirically networks, mass diseasebehavior models can be considered to analyze the efficiency of existing or novel proposed prevention measures and provide constructive viewpoint for policy makers of public health [149] [150] [151] [152] [153] [154] . based on the above achievements, it is now clear that incorporating behavior epidemiology into networked populations has opened a new window for the study of epidemic transmission and prevention. to capture an overall image, table 2 provides a summary for the reviewed characteristics of disease-behavior dynamics in networked populations. here it is worth mentioning that some works (e.g., [93, 117] ) may appear in two categories because they simultaneously consider the influence of individual behavior and special network structure. fig. 9 . age-specific contact matrices for each of eight examined european countries. high contact rates are represented by white color, intermediate contact rates are green and low contact rates are blue. we refer to [155] , from where this figure has been adapted, for further details. (for interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) many of these achievements are closely related with physics phenomena (see table 3 ), via which we can estimate the effect of the proposed strategies and measures. on the other hand, these achievements are also inseparable from classical physics methods (table 3 ). in particular, monte carlo simulation and mean-field prediction have attracted the greatest attention due to simplicity and high efficiency. for a comprehensive understanding, we provide a general example of mean-field theory about behavioral epidemiology in appendix a. the first mathematical models studied the adaptive dynamics of disease-behavior responses in the homogeneously mixed population, assuming that individuals interact with each other at the same contact rate, without restrictions on selecting potential partners. networked dynamics models shift the focus on the effects of interpersonal connectivity patterns, since everyone has their own set of contacts through which the interpersonal transmission can occur. the contacts between members of a population constitute a network, which is usually described by some well-known proxy model of synthetic networks, as shown in section 3. this physics treatment of using evidence-based parsimonious models is valuable in illustrating fascinating ideas and revealing unexpected phenomena. however, it is not always a magic bullet for understanding, explaining, or predicting realistic cases. in recent years, the studies of social experiments become more and more popular. they contribute new insight in parameterizing and designing more appropriate dynamics models. this section briefly introduces the progress in this field. a large-scale population-based survey of human contact patterns in eight european countries was conducted in ref. [155] , which collects the empirical data of self-reported face-to-face conversations and skin-to-skin physical contacts. the data analysis splits the population into subgroups on the basis of properties such as ages and locations, and scrutinizes the contact rate between subgroups (see fig. 9 ). it reveals that across these countries, people are more probable to contact others of the same age group. combining self-reporting contact data with serological testing data, . we refer to [164] , from where this figure has been adapted, for further details. recent case studies are able to reveal location-or age-specific patterns of exposure to pandemic pathogens [156, 157] , which provide rich information for complex network modeling. sf networks have been widely used to model the connectivity heterogeneity in human contact networks. in sf networks, each hub member can have numerous connections, including all its potential contacts relevant to the transmission. however, such common consideration might not fully agree with human physiological limitations regarding the capacity in preserving a large number of interactions. generally, the size of individual social network is restricted to around 150 people [158, 159] . to better characterize the features of human contact behavior, social experiments studying the active contacts in realistic circumstances will be valuable. thanks to the development of information and communication technologies, the usage of digital equipments becomes increasingly popular in collecting empirical data relevant to human contacts in realistic social circumstances. it is instructive to first review a few brief examples. refs. [160, 161] referred to the bluetooth technique embedded in mobile phones, which collects the proxy data of person-to-person interactions of mit media laboratory in the reality mining program; with the help of wireless sensors, the social experiment was conducted to trace the close proximity contacts among the members of an american high school [162] ; refs. [163] [164] [165] [166] considered the active radio frequency identification devices (rfid) to establish a flexible platform recording the face-to-face proximity contacts among volunteers, which had been deployed in various social contexts such as conference, museum, hospital, and primary school; and the wifi accessing data among all students and staffs were also analyzed as the indirect proxy records of their concurrent communications in one chinese university [167, 168] . compared with the abovementioned data of questionnaires, the electronic data generated by digital experiments is more accurate and objective. moreover, some new interesting findings are also listed as follows. the data analysis reveals an unexpected feature that the distribution of the number of distinct persons each individual encounters every day only has a small squared coefficient of variance [162, 164, [166] [167] [168] [169] , irrespective of the specific social contexts (see fig. 10 ). this homogeneity in the so-called node-degree distribution indicates the absence of connectivity hubs, which is, to some extent, subject to our physiological limitations. the dynamics of human interactions is not evolving at an equilibrium state, but is highly fluctuating and timevarying in realistic situations. this can be characterized by measuring the statistical distribution of the duration per contact and the time intervals between successive contacts [163] . as shown in fig. 11 , these two statistics both have a broad distribution spanning several orders of magnitude. most contact durations and inter-contact intervals are very short, but long durations and intervals also emerge, which corresponds to a burst process without characteristic time scales [170] . the coexistence of homogeneity in degree of nodes and heterogeneity in contact durations lead to unexpected phenomena. for example, the low-degree nodes which are insignificant in conventional network models can act as hubs in time-varying networks [171] . the usage of electronic devices provides an easy and nonintrusive approach for contact tracing, which can help understand health-related behaviors in realistic settings. to measure close proximity interactions between health-care fig. 11(a) . similar to fig. 10 , each symbol denotes one venue (sfhh denotes sfhh, nice, fr; eswc09 (eswc10) is eswc 2009 (2010), crete, gr; and ps corresponds to primary school, lyon, fr). we refer to [174] , from where this figure has been adapted, for further details. workers (hcws) and their hand hygiene adherence, polgreen et al. performed experiments by deploying wireless sensor networks in a medical intensive care unit of the university of iowa hospital. they confirmed the effects of peer pressure on improving the hand hygiene participation [172] , i.e., the proximity of other hcws, can promote the hand hygiene adherence. they also analyzed the role of "superspreader", who has a high frequency in encountering others [173] . the disease severity increases with hand hygiene noncompliance of such people. except for empirical data of contact networks, social behavior experiments (or surveys) also play an important role in the vaccination campaign and disease spreading, especially combined with the decision-making process. here we will review the recent progress within this realm. role of altruistic behavior. game theory has been extensively used in the study of behavior epidemiology, where individuals are usually assumed to decide vaccination or not based on the principle of maximizing self-interest [9, 175] . however, in reality, when people make vaccination decision, do they only consider their own benefit? to test this fundamental assumption, ref. [176] recently conducted a survey about individual vaccination decisions during the influenza season. the questionnaires, from direct campus survey and internet-based survey, are mainly composed of two items: self-interest ones (the concern about becoming infected) and altruistic ones (the concern about infecting others), as schematically illustrated in fig. 12 . if agents are driven by self-interest, they attempt to minimize their cost associated with vaccination and infection, which gives rise to selfish equilibrium (or the so-called nash equilibrium). by contrary, if individual decision is guided by altruistic motivation, the vaccination probability reaches community optimum (or the so-called utilitarian equilibrium), at which overall cost of the community is minimal. the authors unveiled that altruism plays an important role in vaccination decision, which can be quantitatively measured by "degree of altruism". to further evaluate its impact, they incorporated the empirical data and altruistic motivation into svir compartmental model. interestingly, they found that altruism can shift vaccination decisions from individual self-interest to a community optimum via greatly enhancing total vaccination coverage and reducing the total cost, morbidity and mortality of the whole community, irrespective of parameter setup. along this line, the role of altruistic behavior in age-structure populations was further explored [177] . according to general experience, elderly people, who are most likely to be infected in the case of influenza, should be most protected by young vaccinators, who are responsible for most disease transmission. to examine under which condition young agents vaccinate to better protect old ones, the authors organized the corresponding social behavior experiment: participants are randomly assigned to "young" and "elderly" roles (with young players contributing more to herd immunity yet elderly players facing higher costs of infection). if players were paid based on individual point totals, more elderly than young players would get vaccinated, which is consistent with the theoretical prediction of fig. 12 . schematic illustration of questionnaire used in the voluntary vaccination survey. the survey items can be divided into self-interest ones (i.e., outcomes-for-self) and altruism ones (i.e., outcomes-for-others), which have corresponding scores. based on both, it becomes possible to indirectly estimate the degree of altruism, which plays a significant role in vaccination uptake and epidemic elimination. we refer to [176] , from where this figure has been adapted, for further details. self-interested behavior (namely, nash equilibrium). on the contrary, players paid according to the group point totals make decisions in a manner consistent with the utilitarian equilibrium, which predicts community-optimal behavior: more young than elderly players get vaccinated yet less cost. in this sense, payoff structure plays a vital role in the emergence of altruistic behavior, which in turn affects the disease spreading. from both empirically studies, we can observe that altruism significantly impacts vaccination coverage as well as consequent disease burden. it can drive system to reach community optimum, where smallest overall cost guarantees herd immunity. it is thus suggested that in realistic policies altruism should be regarded as one potential strategy to improve public health outcomes. existence of free-riding behavior. accompanying altruistic behavior, another type of behavior addressed within decision-making frameworks is free-riding behavior, which means that people can benefit from the action of others while avoiding any cost [79, 178, 179] . in a voluntary vaccination campaign, free riders are unvaccinated individuals who avoid infection because of herd immunity, as illustrated by the gray nodes in fig. 4 . to explore the impact of free-riding behavior, john et al. even organized a questionnaire containing six different hypothetical scenarios twenty years ago [180] . under such a survey, altruism and free-riding were simultaneously considered as the potential decision motivations for vaccination. they found that, for vaccine conferring herd immunity, the free-riding frame causes less sensitivity to increase vaccination coverage than does the altruism frame, which means that free-riding lowers preference of vaccination when the proportion of others vaccinating increases. in addition to homogeneous groups of individuals, yoko et al. recently conducted a computerized influenza experiment, where the groups of agents may face completely different conditions, such as infection risk, vaccine cost, severity of influenza and age structure [181] . they found that high vaccination rate of previous rounds certainly decreases the likelihood of individuals' vaccination acceptance in the following round, indicating the existence of free-riding behavior. both empirical surveys thus showed that individuals' decision-making may be driven by the free-riding motive, which depresses vaccination coverage. besides the above examples, there exist more factors, such as individual cognition [182] and confidence [183] , affecting the decision of vaccination in reality. if possible, these factors should be taken into consideration by public policy makers in order to reach the necessary level of vaccination coverage. the growth in online social networks such as twitter in recent years provides a new opportunity to obtain the data on health behaviors in near real-time. using the short text messages (tweets) data collected from twitter between august 2009 and january 2010, during which pandemic influenza a (h1n1) spread globally, salathé et al. analyzed the spatiotemporal individuals' sentiments towards the novel influenza a (h1n1) vaccine [97] . they found that projected vaccination rates on the basis of sentiments of twitter users can be in good accord with those estimated by the centers for disease control and prevention of united states. they also revealed a critical problem that both negative and positive opinions can be clustered to form network communities. if this can generate clusters of unvaccinated individuals, the risk of disease outbreaks will be largely increased. we have reviewed some of the recent, rapidly expanding research literature concerning nonlinear coupling between disease dynamics and human behavioral dynamics in spatially distributed settings, especially complex networks. generally speaking, these models show that emergent self-protective behavior can dampen an epidemic. this is also what most mean-field models predict. however, in many cases, that is where the commonality in model predictions ends. for populations distributed on a network, the structure of the disease contact network and/or social influence network can fundamentally alter the outcomes, such that different models make very different predictions depending on the assumptions about human population and diseases being studied, including findings that disease-behavior interactions can actually worsen health outcomes by increasing long-term prevalence. also, because network models are individual-based, they can represent processes that are difficult to represent with mean-field (homogeneous mixing) models. for example, the concept of the neighbor of an individual has a natural meaning in a network model, but the meaning is less clear in mean-field models (or partial differential equation models) where populations are described in terms of densities at a point in space. we speculate that the surge of research interest in this area has been fuelled by a combination of (1) the individual-based description that characterizes network models, (2) the explosion of available data at the individual level from digital sources, and (3) the realization from recent experiences with phenomena such as vaccine scares and quarantine failures that human behavior is becoming an increasingly important determinant of disease control efforts. we also discussed how many of the salient dynamics exhibited by disease-behavior systems are directly analogous to processes in statistical physics, such as phase transitions and self-organization. the growth in research has created both opportunities as well as pitfalls. a first potential pitfall is that coupled disease-behavior models are significantly more complicated than simple disease dynamic or behavior dynamic models on their own. for a coupled disease-behavior model, it is necessary not only to have a set of parameters describing the human behavioral dynamics and the disease dynamics separately, it is also possible to have a set of parameters to describe the impact of human behavior on disease dynamics, and another set to describe the effect of disease dynamics on human behavior. thus, approximately speaking, these models have four times as many parameters as a disease dynamic model on its own, or a human behavioral model on its own: they are subject to the "curse of dimensionality". a second pitfall is that relevant research from other fields may not be understood or incorporated in the best possible way. for example, the concept of 'social contagion' appears repeatedly in the literature on coupled disease-behavior models. this is a seductive concept, and it appears to be a natural concept for discussing systems where a disease contagion is also present. however, the metaphor may be too facile. for example, how can the social contagion metaphor capture the subtle but important distinction between descriptive social norms (where individuals follow a morally neutral perception of what others are doing) and injunctive social norms (where individuals follow a morally-laden perception of what others are doing) [184] ? social contagion may be a useful concept, but we should remember that it ultimately is only a metaphor. a third pitfall is lack of integration between theoretical models and empirical data: this pitfall is common to all mathematical modeling exercises. the second and third pitfalls are an unsurprising consequence of combining natural and human system dynamics in the same framework. there are other potential pitfalls as well. these pitfalls also suggest ways in which we can move the field forward. for example, the complexity of models calls for new methods of analysis. in some cases, methods of rigorous analysis (including physics-based methods such as percolation theory and pair approximations (appendix b))-for sufficiently simple systems that permit such analysis-may provide clearer and more rigorous insights than the output of simulation models, which are often harder to fully understand. for systems that are too complicated for pen-and-paper methods, then methods for visualization of large and multidimensional datasets may prove useful. the second and third pitfalls, where physicists and other modelers, behavioral scientists and epidemiologists do not properly understand one another's fields can be mitigated through more opportunities for interaction between the fields through workshops, seminars and colloquia. interactions between scholars in these fields if often stymied by institutional barriers that emphasize a 'silo' approach to academic, thus, a change in institutional modes of operation could be instrumental in improving collaborations between modelers, behavioral scientists and epidemiologists. scientists have already shown that these pitfalls can be overcome in the growing research in this area, and this is evidence in much of the research we have described in this review. the field of coupled disease-behavior modeling has the elements to suggest that it will continue expanding for the foreseeable future: growing availability of data needed to test empirical models, a rich set of potential dynamics created opportunities to apply various analysis methods from physics, and relevance to pressing problems facing humanity. physicists can play an important role in developing this field due to their long experience in applying modeling methods to physical systems. where [ss] is the number of susceptible-susceptible pairs in the population, q(i | ss) is the expected number of infected neighbors of a susceptible in a susceptible-susceptible pair, and similarly q(i | si ) is the expected number of infected neighbors of the susceptible person in a susceptible-infected pair. the first term corresponding to creation of new si pairs from ss pairs, through infection, while the second term corresponds to destruction of existing si pairs through infection, thereby creating ii pairs. an assumption must be made in order to close the equations at the pair level, thereby preventing writing down equations of motion for triples. for instance, on a random graph, the approximation might be applied, where q(i | s) is the expected number of infected persons neighboring a susceptible person in the population. equations and pair approximations for the other pair variables [ss] and [i i ] must also be made, after which one has a closed set of differential equations that capture spatial effects implicitly, by tracking the time evolution of pair quantities. the black death: a biological reappraisal world health organization. the world health report 1996: fighting disease; fostering development. world health organization the mathematics of infectious diseases perspectives on the basic reproductive ratio theory of games and economic behaviour equilibrium points in n-person games game theory with applications to economics game theory and evolutionary biology vaccination and the theory of games universal scaling for the dilemma strength in evolutionary games dangerous drivers foster social dilemma structures hidden behind a traffic flow with lane changes computer science and game theory group interest versus self-interest in smallpox vaccination policy evolutionary game theory evolutionary games on graphs coevolutionary games-a mini review emergent hierarchical structures in multiadaptive games evolutionary game theory: temporal and spatial effects beyond replicator dynamics evolutionary stable strategies and game dynamics on the evolution of random graphs statistical mechanics of complex networks scaling and percolation in the small-world network model emergence of scaling in random networks collective dynamics of 'small-world' networks the structure of scientific collaboration networks complex networks: structure and dynamics the structure and dynamics of multilayer networks multilayer networks evolutionary games on multilayer networks: a colloquium temporal networks activity driven modeling of time varying networks spread of epidemic disease on networks nonequilibrium phase transitions in lattice models influence of infection rate and migration on extinction of disease in spatial epidemics the evolutionary vaccination dilemma in complex networks effects of delayed recovery and nonuniform transmission on the spreading of diseases in complex networks influence of time delay and nonlinear diffusion on herbivore outbreak epidemic spreading in scale-free networks percolation critical exponents in scale-free networks absence of epidemic threshold in scale-free networks with degree correlations epidemic incidence in correlated complex networks epidemic spreading in community networks competing activation mechanisms in epidemics on networks epidemic thresholds in real networks epidemics and immunization in scale-free networks the structure and function of complex networks the web of human sexual contacts risk perception in epidemic modeling nonpharmaceutical interventions implemented by us cities during the 1918-1919 influenza pandemic public health interventions and epidemic intensity during the 1918 influenza pandemic alcohol-based instant hand sanitizer use in military settingsa prospective cohort study of army basic trainees rational epidemics and their public control integrating behavioural choice into epidemiological models of the aids epidemic choices, beliefs and infectious disease dynamics public avoidance and epidemics: insights from an economic model adaptive human behavior in epidemiological models game theory of social distancing in response to an epidemic a mathematical analysis of public avoidance behavior during epidemics using game theory equilibria of an epidemic game with piecewise linear social distancing cost modeling and analyzing hiv transmission: the effect of contact patterns structured mixing: heterogeneous mixing by the definition of activity groups factors that make an infectious disease outbreak controllable transmission dynamics and control of severe acute respiratory syndrome curtailing transmission of severe acute respiratory syndrome within a community and its hospital the effect of risk perception on the 2009 h1n1 pandemic influenza dynamics behavior changes in sis std models with selective mixing infection-age structured epidemic models with behavior change or treatment coupled contagion dynamics of fear and disease: mathematical and computational explorations on the existence of a threshold for preventive behavioral responses to suppress epidemic spreading towards a characterization of behavior-disease models the impact of information transmission on epidemic outbreaks modeling and analysis of effects of awareness programs by media on the spread of infectious diseases coevolution of pathogens and cultural practices: a new look at behavioral heterogeneity in epidemics spontaneous behavioural changes in response to epidemics risk perception and effectiveness of uncoordinated behavioral responses in an emerging epidemic a generalization of the kermack-mckendrick deterministic model the spread of awareness and its impact on epidemic outbreaks social contact networks and disease eradicability under voluntary vaccination the impact of awareness on epidemic spreading in networks suppression of epidemic spreading in complex networks by local information based behavioral responses epidemic spreading with information-driven vaccination intermittent social distancing strategy for epidemic control peer pressure is a double-edged sword in vaccination dynamics imitation dynamics of vaccination behaviour on social networks insight into the so-called spatial reciprocity impact of committed individuals on vaccination behavior wisdom of groups promotes cooperation in evolutionary social dilemmas braess's paradox in epidemic game: better condition results in less payoff price of anarchy in transportation networks: efficiency and optimality control effects of behavioral response and vaccination policy on epidemic spreading-an approach based on evolutionary-game dynamics modeling the interplay between human behavior and the spread of infectious diseases the impact of imitation on vaccination behavior in social contact networks a computational approach to characterizing the impact of social influence on individuals' vaccination decision making risk assessment for infectious disease and its impact on voluntary vaccination behavior in social networks vaccination and public trust: a model for the dissemination of vaccination behavior with external intervention assessing vaccination sentiments with online social media: implications for infectious disease dynamics and control erratic flu vaccination emerges from short-sighted behavior in contact networks the dynamics of risk perceptions and precautionary behavior in response to 2009 (h1n1) pandemic influenza optimal interdependence between networks for the evolution of cooperation social factors in epidemiology catastrophic cascade of failures in interdependent networks globally networked risks and how to respond eigenvector centrality of nodes in multiplex networks synchronization of interconnected networks: the role of connector nodes epidemic spreading on interconnected networks effects of interconnections on epidemics in network of networks the robustness and restoration of a network of ecological networks diffusion dynamics on multiplex networks dynamical interplay between awareness and epidemic spreading in multiplex networks competing spreading processes on multiplex networks: awareness and epidemics two-stage effects of awareness cascade on epidemic spreading in multiplex networks effects of awareness diffusion and self-initiated awareness behavior on epidemic spreading-an approach based on multiplex networks spontaneous behavioural changes in response to epidemics coupling infectious diseases, human preventive behavior, and networks-a conceptual framework for epidemic modeling modeling triple-diffusions of infectious diseases, information, and preventive behaviors through a metropolitan social networkan agent-based simulation influence of breaking the symmetry between disease transmission and information propagation networks on stepwise decisions concerning vaccination asymmetrically interacting spreading dynamics on complex layered networks modelling the influence of human behaviour on the spread of infectious diseases: a review adaptive coevolutionary networks: a review exact solution for the time evolution of network rewiring models epidemic reemergence in adaptive complex networks temporal networks: slowing down diffusion by long lasting interactions epidemic dynamics on an adaptive network robust oscillations in sis epidemics on adaptive networks: coarse graining by automated moment closure fluctuating epidemics on adaptive networks adaptive networks: coevolution of disease and topology infection spreading in a population with evolving contacts contact switching as a control strategy for epidemic outbreaks susceptible-infected-recovered epidemics in dynamic contact networks absence of epidemic thresholds in a growing adaptive network epidemic spreading in evolving networks enhanced vaccine control of epidemics in adaptive networks efficient community-based control strategies in adaptive networks evolutionary dynamics of time-resolved social interactions random walks on temporal networks outcome inelasticity and outcome variability in behavior-incidence models: an example from an sir infection on a dynamic network exploiting temporal network structures of human interaction to effectively immunize populations pastor-satorras r. immunization strategies for epidemic processes in time-varying contact networks the effect of opinion clustering on disease outbreaks network theory and sars: predicting outbreak diversity positive network assortativity of influenza vaccination at a high school: implications for outbreak risk and herd immunity an ongoing multi-state outbreak of measles linked to non-immune anthroposophic communities in austria measles outbreak in switzerland-an update relevant for the european football championship dynamics and control of diseases in networks with community structure spreading of sexually transmitted diseases in heterosexual populations particle swarm optimization with scale-free interactions modelling dynamical processes in complex socio-technical systems modelling disease outbreaks in realistic urban social networks complex social contagion makes networks more vulnerable to disease outbreaks traffic-driven epidemic spreading in finite-size scale-free networks impact of rotavirus vaccination on epidemiological dynamics in england and wales epidemiological effects of seasonal oscillations in birth rates dynamic modeling of vaccinating behavior as a function of individual beliefs social contacts and mixing patterns relevant to the spread of infectious diseases location-specific patterns of exposure to recent pre-pandemic strains of influenza a in southern china social contacts and the locations in which they occur as risk factors for influenza infection the social brain hypothesis modeling users' activity on twitter networks: validation of dunbar's number reality mining: sensing complex social systems inferring friendship network structure by using mobile phone data a high-resolution human contact network for infectious disease transmission dynamics of person-to-person interactions from distributed rfid sensor networks what's in a crowd? analysis of face-to-face behavioral networks high-resolution measurements of face-to-face contact patterns in a primary school predictability of conversation partners towards a temporal network analysis of interactive wifi users characterizing large-scale population's indoor spatio-temporal interactive behaviors spatial epidemiology of networked metapopulation: an overview bursts: the hidden patterns behind everything we do, from your e-mail to bloody crusades. penguin temporal dynamics and impact of event interactions in cyber-social populations do peer effects improve hand hygiene adherence among healthcare workers? analyzing the impact of superspreading using hospital contact networks temporal networks of face-to-face human interactions long-standing influenza vaccination policy is in accord with individual self-interest but not with the utilitarian optimum the influence of altruism on influenza vaccination decisions using game theory to examine incentives in influenza vaccination behavior multiple effects of self-protection on the spreading of epidemics imperfect vaccine aggravates the long-standing dilemma of voluntary vaccination the roles of altruism, free riding, and bandwagoning in vaccination decisions free-riding behavior in vaccination decisions: an experimental study cognitive processes and the decisions of some parents to forego pertussis vaccination for their children improving public health emergency preparedness through enhanced decision-making environments: a simulation and survey based evaluation social influence: social norms, conformity and compliance correlation equations and pair approximations for spatial ecologies a moment closure model for sexually transmitted disease spread through a concurrent partnership network we would like to acknowledge gratefully yao yao, yang liu, ke-ke huang, yan zhang, eriko fukuda, dr. wen-bo du, dr. ming tang, dr. hai-feng zhang and prof. zhen jin for their constructive helps and discussions, also appreciate all the other friends with whom we maintained (and are currently maintaining) interactions and discussions on the topic covered in our report. this work was partly supported by the national natural science foundation of china (grant no. 61374169, 11135001, 11475074) and the natural sciences and engineering council of canada (nserc; ma and ctb). whenever the classical epidemic spreading processes (sis, sir) are taking place in homogeneous populations, e.g., individuals are located on the vertices of regular random graph, er random graph, completed connected graph, the qualitative properties of the dynamics can be well catched by the mean-field analysis, assuming unbiased random matching among the individuals. for simplicity yet without loss of generality, we here derive the mean-field solution for the peer-pressure effect in vaccination dynamics on the er random graph as an example [84] . let x be the fraction of vaccinated individuals and w(x) be the probability that a susceptible individual finally gets infected in a population with vaccine coverage x. after each sir epidemic season, individuals get payoffs: vaccinated (i, p i = −c), unvaccinated and healthy (j , p j = 0), and unvaccinated and infected (ς ,p ς = −1). individuals are allowed to modify their vaccination strategies in terms of eq. (7). whenever an individual from compartment i goes to compartment j or ς , the variable x drops, which can be formulated aswhere above we have approximated, in the spirit of mean-field treatment, the fraction of neighbors holding opposite strategy of the vaccinated individuals as 1 − x. the quantity i→j is the probability that individuals from the compartment i change to the compartment j , whose value is determined by eq. (7) . accordingly, the gain of x can be written astake the two cases into consideration, the derivative of x with respect to time is written as dx/dt = x + + x − . solving for x, we get the equilibrium vaccination level f v . note that the equilibrium epidemic size is expected to be f i = w(x), satisfying the self-consistent equationwhere r 0 is the basic reproductive number of the epidemic. pair approximation is a method by which space can be implicitly captured in an ordinary differential equation framework [185, 186] . to illustrate the method, consider the variable [s], defined as the number of susceptible individuals in a population distributed across a network or a lattice. for an sir natural history, the equation of motion for s is: key: cord-346309-hveuq2x9 authors: reis, ben y; kohane, isaac s; mandl, kenneth d title: an epidemiological network model for disease outbreak detection date: 2007-06-26 journal: plos med doi: 10.1371/journal.pmed.0040210 sha: doc_id: 346309 cord_uid: hveuq2x9 background: advanced disease-surveillance systems have been deployed worldwide to provide early detection of infectious disease outbreaks and bioterrorist attacks. new methods that improve the overall detection capabilities of these systems can have a broad practical impact. furthermore, most current generation surveillance systems are vulnerable to dramatic and unpredictable shifts in the health-care data that they monitor. these shifts can occur during major public events, such as the olympics, as a result of population surges and public closures. shifts can also occur during epidemics and pandemics as a result of quarantines, the worried-well flooding emergency departments or, conversely, the public staying away from hospitals for fear of nosocomial infection. most surveillance systems are not robust to such shifts in health-care utilization, either because they do not adjust baselines and alert-thresholds to new utilization levels, or because the utilization shifts themselves may trigger an alarm. as a result, public-health crises and major public events threaten to undermine health-surveillance systems at the very times they are needed most. methods and findings: to address this challenge, we introduce a class of epidemiological network models that monitor the relationships among different health-care data streams instead of monitoring the data streams themselves. by extracting the extra information present in the relationships between the data streams, these models have the potential to improve the detection capabilities of a system. furthermore, the models' relational nature has the potential to increase a system's robustness to unpredictable baseline shifts. we implemented these models and evaluated their effectiveness using historical emergency department data from five hospitals in a single metropolitan area, recorded over a period of 4.5 y by the automated epidemiological geotemporal integrated surveillance real-time public health–surveillance system, developed by the children's hospital informatics program at the harvard-mit division of health sciences and technology on behalf of the massachusetts department of public health. we performed experiments with semi-synthetic outbreaks of different magnitudes and simulated baseline shifts of different types and magnitudes. the results show that the network models provide better detection of localized outbreaks, and greater robustness to unpredictable shifts than a reference time-series modeling approach. conclusions: the integrated network models of epidemiological data streams and their interrelationships have the potential to improve current surveillance efforts, providing better localized outbreak detection under normal circumstances, as well as more robust performance in the face of shifts in health-care utilization during epidemics and major public events. abbreviations: aegis, automated epidemiological geotemporal integrated surveillance; cusum, cumulative sum; ewma, exponential weighted moving average; sars, severe acute respiratory syndrome * to whom correspondence should be addressed. e-mail: ben_reis@ harvard.edu advanced disease-surveillance systems have been deployed worldwide to provide early detection of infectious disease outbreaks and bioterrorist attacks. new methods that improve the overall detection capabilities of these systems can have a broad practical impact. furthermore, most current generation surveillance systems are vulnerable to dramatic and unpredictable shifts in the health-care data that they monitor. these shifts can occur during major public events, such as the olympics, as a result of population surges and public closures. shifts can also occur during epidemics and pandemics as a result of quarantines, the worriedwell flooding emergency departments or, conversely, the public staying away from hospitals for fear of nosocomial infection. most surveillance systems are not robust to such shifts in health-care utilization, either because they do not adjust baselines and alert-thresholds to new utilization levels, or because the utilization shifts themselves may trigger an alarm. as a result, public-health crises and major public events threaten to undermine health-surveillance systems at the very times they are needed most. to address this challenge, we introduce a class of epidemiological network models that monitor the relationships among different health-care data streams instead of monitoring the data streams themselves. by extracting the extra information present in the relationships between the data streams, these models have the potential to improve the detection capabilities of a system. furthermore, the models' relational nature has the potential to increase a system's robustness to unpredictable baseline shifts. we implemented these models and evaluated their effectiveness using historical emergency department data from five hospitals in a single metropolitan area, recorded over a period of 4.5 y by the automated epidemiological geotemporal integrated surveillance real-time public health-surveillance system, developed by the children's hospital informatics program at the harvard-mit division of health sciences and technology on behalf of the massachusetts department of public health. we performed experiments with semi-synthetic outbreaks of different magnitudes and simulated baseline shifts of different types and magnitudes. the results show that the network models provide better detection of localized outbreaks, and greater robustness to unpredictable shifts than a reference time-series modeling approach. understanding and monitoring large-scale disease patterns is critical for planning and directing public-health responses during pandemics [1] [2] [3] [4] [5] . in order to address the growing threats of global infectious disease pandemics such as influenza [6] , severe acute respiratory syndrome (sars) [7] , and bioterrorism [8] , advanced disease-surveillance systems have been deployed worldwide to monitor epidemiological data such as hospital visits [9, 10] , pharmaceutical orders [11] , and laboratory tests [12] . improving the overall detection capabilities of these systems can have a wide practical impact. furthermore, it would be beneficial to reduce the vulnerability of many of these systems to shifts in health-care utilization that can occur during public-health emergencies such as epidemics and pandemics [13] [14] [15] or during major public events [16] . we need to be prepared for the shifts in health-care utilization that often accompany major public events, such as the olympics, caused by population surges or closures of certain areas to the public [16] . first, we need to be prepared for drops in health-care utilization under emergency conditions, including epidemics and pandemics where the public may stay away from hospitals for fear of being infected, as 66.7% reported doing so during the sars epidemic in hong kong [13] . similarly, a detailed study of the greater toronto area found major drops in numerous types of health-care utilization during the sars epidemic, including emergency department visits, physician visits, inpatient and outpatient procedures, and outpatient diagnostic tests [14] . second, the ''worried-well''-those wrongly suspecting that they have been infected-may proceed to flood hospitals, not only stressing the clinical resources, but also dramatically shifting the baseline from its historical pattern, potentially obscuring a real signal [15] . third, public-health interventions such as closures, quarantines, and travel restrictions can cause major changes in health-care utilization patterns. such shifts threaten to undermine disease-surveillance systems at the very times they are needed most. during major public events, the risks and potential costs of bioterrorist attacks and other public-health emergencies increase. during epidemics, as health resources are already stretched, it is important to maintain disease outbreaksurveillance capabilities and situational awareness [4, 5] . at present, many disease-surveillance systems rely either on comparing current counts with historical time-series models, or on identifying sudden increases in utilization (e.g., cumulative sum [cusum] or exponential weighted moving average [ewma] [9, 10] ). these approaches are not robust to major shifts in health-care utilization: systems based on historical time-series models of health-care counts do not adjust their baselines and alert-thresholds to the new unknown utilization levels, while systems based on identifying sudden increases in utilization may be falsely triggered by the utilization shifts themselves. in order to both improve overall detection performance and reduce vulnerability to baseline shifts, we introduce a general class of epidemiological network models that explicitly capture the relationships among epidemiological data streams. in this approach, the surveillance task is transformed from one of monitoring health-care data streams, to one of monitoring the relationships among these data streams: an epidemiological network begins with historical time-series models of the ratios between each possible pair of data streams being monitored. (as described in discussion, it may be desirable to model only a selected subset of these ratios.) these ratios do not remain at a constant value; rather, we assume that these ratios vary in a predictable way according to seasonal and other patterns that can be modeled. the ratios predicted by these historical models are compared with the ratios observed in the actual data in order to determine whether an aberration has occurred. the complete approach is described in detail below. these network models have two primary benefits. first, they take advantage of the extra information present in the relationships between the monitored data streams in order to increase overall detection performance. second, their relational nature makes them more robust to the unpredictable shifts described above, as illustrated by the following scenario. the olympics bring a large influx of people into a metropolitan area for 2 wk and cause a broad surge in overall health-care utilization. in the midst of this surge, a localized infectious disease outbreak takes place. the surge in overall utilization falsely triggers the alarms of standard biosurveillance models and thus masks the actual outbreak. on the other hand, since the surge affects multiple data streams similarly, the relationships between the various data streams are not affected as much by the surge. since the network model monitors these relationships, it is able to ignore the surge and thus detect the outbreak. our assumption is that broad utilization shifts would affect multiple data streams in a similar way, and would thus not significantly affect the ratios among these data streams. in order to validate this assumption, we need to study the stability of the ratios around real-world surges. this assessment is difficult to do, since for most planned events, such as the olympics, additional temporary health-care facilities are set up at the site of the event in order to deal with the expected surge. this preparation reduces or eliminates the surge that is recorded by the permanent health-care system, and therefore makes it hard to find data that describe surges. however, some modest shifts do appear in the health-care utilization data, and they are informative. we obtained data on the 2000 sydney summer olympics directly from the centre for epidemiology and research, new south wales department of health, new south wales emergency department data collection. the data show a 5% surge in visits during the olympics. while the magnitude of this shift is far less dramatic than those expected in a disaster, the sydney olympics nonetheless provide an opportunity to measure the stability of the ratios under surge conditions. despite the surge, the relative rates of major syndromic groups remained very stable between the same periods in 1999 and 2000. injury visits accounted for 21.50% of overall visits in 1999, compared with an almost identical 21.53% in 2000. gastroenteritis visits accounted for 5.84% in 1999, compared with 5.75% in 2000. as shown in table 1 , the resulting ratios among the different syndromic groups remained stable. although we would have liked to examine the stability of ratios in the face of a larger surge, we were not able to find a larger surge for which multi-year health-care utilization data were available. it is important to note that while the above data about a planned event are informative, surveillance systems need to be prepared for the much larger surges that would likely accompany unplanned events, such as pandemics, natural disasters, or other unexpected events that cause large shifts in utilization. initial motivation for this work originated as a result of the authors' experience advising the hellenic center for infectious diseases control in advance of the 2004 summer olympics in athens [17] , where there was concern that a population surge caused by the influx of a large number of tourists would significantly alter health-care utilization patterns relative to the baseline levels recorded during the previous summer. the epidemiological network model was then formalized in the context of the us centers for disease control and prevention's nationwide biosense health-surveillance system [18] , for which the authors are researching improved surveillance methods for integration of inputs from multiple health-care data streams. biosense collects and analyzes health-care utilization data, which have been made anonymous, from a number of national data sources, including the department of defense and the veteran's administration, and is now procuring local emergency department data sources from around the united states. in order to evaluate the practical utility of this approach for surveillance, we constructed epidemiological network models based on real-world historical health-care data and compared their outbreak-detection performance to that of standard historical models. the models were evaluated using semi-synthetic data-streams-real background data with injected outbreaks-both under normal conditions and in the presence of different types of baseline shifts. the proposed epidemiological network model is compared with a previously described reference time-series model [19] . both models are used to detect simulated outbreaks introduced into actual historical daily counts for respiratory-related visits, gastrointestinal-related visits, and total visits at five emergency departments in the same metropolitan area. the data cover a period of 1,619 d, or roughly 4.5 y. the first 1,214 d are used to train the models, while the final 405 d are used to test their performance. the data are collected by the automated epidemiological geotemporal integrated surveillance (aegis) real-time public health-surveillance system, developed by the child-ren's hospital informatics program at the harvard-mit division of health sciences and technology on behalf of the massachusetts department of public health. aegis fully automates the monitoring of emergency departments across massachusetts. the system receives automatic updates from the various health-care facilities and performs outbreak detection, alerting, and visualization functions for publichealth personnel and clinicians. the aegis system incorporates both temporal and geospatial approaches for outbreak detection. the goal of an epidemiological network model is to model the historical relationships among health-care data streams and to interpret newly observed data in the context of these modeled relationships. in the training phase, we construct time-series models of the ratios between all possible pairs of health-care utilization data streams. these models capture the weekly, seasonal, and long-term variations in these ratios. in the testing phase, the actual observed ratios are compared with the ratios predicted by the historical models. we begin with n health-care data streams, s i , each describing daily counts of a particular syndrome category at a particular hospital. for this study, we use three syndromic categories (respiratory, gastrointestinal, and total visits) at five hospitals, for a total of n ¼ 15 data streams. all possible pair-wise ratios are calculated among these n data streams, for a total of n 2 à n ¼ 210 ratios, r ij : for each day, t, we calculate the ratio of the daily counts for stream s i to the daily counts for stream s j . for each ratio, the numerator s i is called the target data stream, and the denominator s j is called the context data stream, since the target data stream is said to be interpreted in the context of the context data stream, as described below. a sample epidemiological network consisting of 30 nodes and 210 edges is shown in figure 1 . the nodes in the network represent the data streams: each of the n data streams appears twice, once as a context data stream and another time as a target data stream. edges represent ratios between data streams: namely, the target data stream divided by the context data stream. to train the network, a time-series model, r ij , is fitted for each ratio, r ij , over the training period using established time-series methods [19] . the data are first smoothed with a 7-d exponential filter (ewma with coefficient 0.5) to reduce the effects of noise [20] . the linear trend is calculated and subtracted out, then the overall mean is calculated and subtracted out, then the day-of-week means (seven values) are calculated and subtracted out, and finally the day-of-year means (365 values) are calculated. in order to generate predictions from this model, these four components are summed, using the appropriate values for day of the week, day of the year, and trend. the difference between each actual ratio, r ij , and its corresponding modeled prediction, r ij , is the error, e ij . during network operation, the goal of the network is to determine the extent to which the observed ratios among the data streams differ from the ratios predicted by the historical models. observed ratios, r ij , are calculated from the observed data, and are compared with the expected ratios to yield the observed errors, e ij : e ij ðtþ ¼ r ij ðtþ à r ij ðtþ ð 3þ in order to interpret the magnitudes of these deviations from the expected values, the observed errors are compared with the historical errors from the training phase. a nonparametric approach is used to rank the current error against the historical errors. this rank is divided by the maximum rank (1 þ the number of training days), resulting in a value of between 0 and 1, which is the individual aberration score, w ij . conceptually, each of the individual aberration scores, w ij , represents the interpretation of the activity of the target data stream, s i , from the perspective of the activity at the context data stream, s j : if the observed ratio between these two data streams is exactly as predicted by the historical model, e ij is equal to 0 and w ij is equal to a moderate value. if the target data stream is higher than expected, e ij is positive and w ij is a higher value closer to 1. if it is lower than expected, e ij is positive and w ij is a lower value closer to 0. high aberration scores, w ij , are represented by thicker edges in the network visualization, as shown in figure 1 . some ratios are more unpredictable than others-i.e., they have a greater amount of variability that is not accounted for by the historical model, and thus a greater modeling error. the nonparametric approach to evaluating aberrations adjusts for this variability by interpreting a given aberration in the context of all previous aberrations for that particular ratio during the training period. it is important to note that each individual aberration score, w ij , can be affected by the activities of both its target and context data streams. for example, it would be unclear from a single high w ij score as to whether the target data stream is unexpectedly high or whether the context data stream is unexpectedly low. in order to obtain an integrated consensus view of a particular target data stream, s i , an integrated consensus score, c i , is created by averaging together all the aberration scores that have s i as the target data stream (i.e., in the numerator of the ratio). this integrated score represents the collective interpretation of the activity at the target node, from the perspective of all the other nodes: each data stream appears twice in the network. the context nodes on the left are used for interpreting the activity of the target nodes on the right. each edge represents the ratio of the target node divided by the context node, with a thicker edge indicating that the ratio is higher than expected. doi:10.1371/journal.pmed.0040210.g001 or an alarm is generated whenever c i is greater than a threshold value c thresh . as described below, this threshold value is chosen to achieve a desired specificity. the nonparameteric nature of the individual aberration scores addresses the potential issue of outliers that would normally arise when taking an average. it is also important to note that while the integrated consensus score helps to reduce the effects of fluctuations in individual context data streams, it is still possible for an extreme drop in one context data stream to trigger a false alarm in a target data stream. this is particularly true in networks having few context data streams. in the case of only one context data stream, a substantial decrease in the count in the context data stream will trigger a false alarm in the target data stream. for comparison, we also implement a reference time-series surveillance approach that models each health-care data stream directly, instead of modeling the relationships between data streams as above. this model uses the same time-series modeling methods described above and previously [19] . first, the daily counts data are smoothed with a 7-d exponential filter. the linear trend is calculated and subtracted out, then the overall mean is calculated and subtracted out, and then the mean for each day of the week (seven values) is calculated and subtracted out. finally, the mean for each day of the year (365 values) is calculated and subtracted out. to generate a prediction, these four components are added together, taking the appropriate values depending on the particular day of the week and day of the year. the difference between the observed daily counts and the counts predicted by the model is the aberration score for that data stream. an alarm is generated whenever this aberration score is greater than a threshold value, chosen to achieve a desired level of specificity, as described below. by employing identical time-series methods for modeling the relationships between the streams in the network approach and modeling the actual data streams themselves in the reference approach, we are able to perform a controlled comparison between the two approaches. following established methods [19] [20] [21] , we use semisynthetic localized outbreaks to evaluate the disease-monitoring capabilities of the network. the injected outbreaks used here follow a 7-d lognormal temporal distribution (figure 2 ), representing the epidemiological distribution of incubation times resulting from a single-source common vehicle infection, as described by sartwell [22] . when injecting outbreaks into either respiratory-or gastrointestinal-related data streams, the same number of visits is also added to the appropriate total-visits data stream for that hospital in order to maintain consistency. multiple simulation experiments are performed, varying the number of data streams used in the network, the target data stream, s i , into which the outbreaks are introduced, and the magnitude of the outbreaks. while many additional outbreak types are possible, the simulated outbreaks used here serve as a paradigmatic set of benchmark stimuli for gauging the relative outbreak-detection performance of the different surveillance approaches. we constructed epidemiological networks from respiratory, gastrointestinal, and total daily visit data from five hospitals in a single metropolitan area, for a total of 15 data streams, s i (n ¼ 15). in training the network, we modeled all possible pair-wise ratios between the 15 data streams, for a total of 210 ratios. for comparison, we implemented the reference time-series surveillance model described above, which uses the same time-series methods but instead of modeling the epidemiological relationships, models the 15 data streams directly. semi-synthetic simulated outbreaks were used to evaluate the aberration-detection capabilities of the network, as described above. we simulated outbreaks across a range of magnitudes occurring at any one of the 15 data streams. for the first set of experiments, 486,000 tests were performed: 15 target data streams 3 405 d of the testing period 3 40 outbreak sizes (with a peak magnitude increase ranging from 2.5% to 100.0%) 3 two models (network versus reference). for the purposes of systematic comparison between the reference and network models, we allowed for the addition of fractional cases in the simulations. we compared the detection sensitivities of the reference and network models by fixing specificity at a benchmark 95% and measuring the sensitivity of the model. in order to measure sensitivity at a desired specificity, we gradually increased the alarm threshold incrementally from 0 to the maximum value until the desired specificity was reached. we then measured the sensitivity at the same threshold. sensitivity is defined in terms of outbreak-days-the proportion of all days during which outbreaks were occurring such that an alarm was generated. at 95% specificity, the network approach significantly outperformed the reference approach in detecting respiratory and gastrointestinal outbreaks, yielding 4.9% 6 1.9% and 6.0% 6 2.0% absolute increases in sensitivity, respectively (representing 19.1% and 34.1% relative improvements in sensitivity, respectively), for outbreaks characterized by a 37.5% increase on the peak day of the outbreak (table 2) . we found this ordering of sensitivities to be consistent over the range of outbreak sizes. for outbreaks introduced into the total-outbreak signals, the reference model achieved 2.1% 6 2% better absolute sensitivity than the network model (2.9% difference in relative sensitivity). this result is likely because the total-visit signals are much larger in absolute terms, and therefore the signal-to-noise ratio is higher (table 3) , making it easier for the reference model to detect the outbreaks. the ''total outbreak'' experiments were run for reasons of comprehensiveness, but it should be noted that there is no clear epidemiological correlate to an outbreak that affects all syndrome groups, other than a population surge, which the network models are designed to ignore as described in the discussion section. also, an increase in total visits without an increase in respiratory or gastrointestinal visits may correspond to an outbreak in yet another syndrome category. table 2 also shows results for the same experiments at three other practical specificity levels, and an average for all four specificity levels. in all cases, the network approach performs better for respiratory and gastrointestinal outbreaks and the reference model performs better in total-visit outbreaks. by visually inspecting the response of the network model to the outbreaks, it can be seen that while the individual aberration scores exhibited fairly noisy behavior throughout the testing period (figure 3) , the integrated consensus scores consolidated the information from the individual aberration scores, reconstructing the simulated outbreaks presented to the system (figure 4 ). next, we studied the effects of different network compositions on detection performance, constructing networks of different sizes and constituent data streams ( figure 5 ). for each target data stream, we created 77 different homogeneous context networks-i.e., networks containing the target data stream plus between one and five additional data streams of a single syndromic category. in total, 1,155 networks were created and analyzed (15 target data streams 3 77 networks). we then introduced simulated outbreaks characterized by a 37.5% increase in daily visit counts over the background counts in the target data stream on the peak day of the outbreak into the target data stream of each network, and calculated the sensitivity obtained from all the networks having particular size and membership characteristics, for a fixed benchmark specificity of 95%. in total, 467,775 tests were performed (1,155 networks 3 405 d). we found that detection performance generally increased with network size ( figure 6 ). furthermore, regardless of which data stream contained the outbreaks, total-visit data streams provided the best context for detection. this is consistent with the greater statistical stability of the totalvisits data streams, which on average had a far smaller variability (table 3) . total data streams were also the easiest target data streams in which to detect outbreaks, followed by respiratory data streams, and then by gastrointestinal data streams. this result is likely because the number of injected cases is a constant proportion of stream size. for a constant number of injected cases, total data streams would likely be the hardest target data streams for detection. next, we systematically compared the performance advantage gained from five key context groups. for a respiratory target signal, the five groups were as follows: (1) total visits at the same hospital; (2) total visits at all other hospitals; (3) gastrointestinal visits at the same hospital; (4) gastrointestinal visits at all other hospitals; and (5) respiratory visits at all other hospitals. if the target signal comprised gastrointestinal or total visits, the five context groups above would be changed accordingly, as detailed in figures 7-9 . given the possibility of either including or excluding each of these five groups, there were 31 (2 5 à 1) possible networks for each target signal. the results of the above analysis are shown for respiratory (figure 7) , gastrointestinal (figure 8) , and total-visit target signals (figure 9 ). each row represents a different network construction. rows are ranked by the average sensitivity achieved over the five possible target signals for that table. the following general trends are apparent. total visits at all the other hospitals were the most helpful context group overall. given a context of all the streams from the same hospital, it is beneficial to add total visits from other hospitals, as well as the same syndrome group from the other hospitals. beginning with a context of total visits from the same hospital, there is a slight additional advantage in including a different syndrome group from the same hospital. in order to gauge the performance of the network and reference models in the face of baseline shifts in health-care utilization, we performed a further set of simulation experiments, where, in addition to the simulated outbreaks of peak magnitude 37.5%, we introduced various types and magnitudes of baseline shifts for a period of 200 d in the middle of the 405-d testing period. we compared the performance of the reference time-series model, the complete network model, and a network model containing only total-visit nodes. for respiratory and gastrointestinal outbreaks, we also compared the performance of a two-node network containing only the target data stream and the total-visit data stream from the same hospital. we began by simulating the effects of a large population surge, such as might be seen during a large public event. we did this by introducing a uniform increase across all data streams for 200 d in the middle of the testing period. we found that the detection performance of the reference model degraded rapidly with increasing baseline shifts, while the performance of the various network models remained stable ( figure 10 ). we next simulated the effects of a frightened public staying away from hospitals during an epidemic. we did this by introducing uniform drops across all data streams for 200 d. here too, we found that the detection performance of the reference model degraded rapidly with increasing baseline shifts, while the performance of the various network models remained robust ( figure 11 ). we then simulated the effects of the ''worried-well'' on a surveillance system by introducing targeted increases in only one syndromic category-respiratory or gastrointestinal ( figure 12) . we compared the performance of the reference model, a full-network model, the two-node networks described above, and a homogeneous network model containing only data streams of the same syndromic category as the target data stream. the performance of the full and homogeneous networks was superior to that of the reference model. the homogeneous networks, consisting of solely respiratory or gastrointestinal data streams, proved robust to the targeted shifts and achieved consistent detection performance even in the face of large shifts. this result is consistent with all the ratios in these networks being affected equally by the targeted baseline shifts. the performance of the full network degraded slightly in the face of larger shifts, while the performance of the two-node network degraded more severely. these results are because the two-node network did not include relationships that were unaffected by the shifts that could help stabilize performance. it should be noted that this same phenomenon-an increase in one syndromic category across multiple locations-may also be indicative of a widespread outbreak, as discussed further below. in this paper, we describe an epidemiological network model that monitors the relationships between health-care utilization data streams for the purpose of detecting disease outbreaks. results from simulation experiments show that these models deliver improved outbreak-detection performance under normal conditions compared with a standard reference time-series model. furthermore, the network models are far more robust than the reference model to the unpredictable baseline shifts that may occur around epidemics or large public events. the results also show that epidemiological relationships are inherently valuable for surveillance: the activity at one hospital can be better understood by examining it in relation to the activity at other hospitals. in a previous paper [20] , we showed the benefits of interpreting epidemiological data in its temporal context-namely, the epidemiological activity on surrounding days [23] . in the present study, we show that it is also beneficial to examine epidemiological data in its network context-i.e., the activity of related epidemiological data streams. based on the results obtained, it is clear that different types of networks are useful for detecting different types of signals. we present eight different classes of signals, their possible interpretations, and the approaches that would be able to detect them: the first four classes of signals involve increases in one or more data streams. (1) a rise in one syndrome group at a single location may correspond to a localized outbreak or simply a data irregularity. such a signal could be detected by all network models as well as the reference model. (2) a rise in all syndrome groups at a single location probably corresponds to a geographical shift in utilization, (e.g., a quarantine elsewhere), as an outbreak would not be expected to cause an increase in all syndrome groups. such a signal would be detected by network models that include multiple locations, and by the reference model. (3) a rise in one syndrome group across all locations may correspond to a widespread outbreak or may similarly result from the visits by the ''worried-well.'' such a signal would be detected by network models that include multiple syndrome groups, and by the reference model. (4) a rise in all syndrome groups in figure 10 . simulation of a population surge during a large public event to simulate a population surge during a large public event, all data streams are increased by a uniform amount (x-axis) for 200 d in the middle of the testing period. full networks, total-visit networks, two-node networks (target data stream and total visits at the same hospital), and reference models are compared. average results are shown for each target data stream type. error bars are standard errors. doi:10.1371/journal.pmed.0040210.g010 figure 11 . simulation of a frightened public staying away from hospitals during a pandemic to simulate a frightened public staying away from hospitals during a pandemic, all data streams are dropped by a uniform amount (x-axis) for 200 d in the middle of the testing period. full networks, total-visit networks, two-node networks (target data stream and total visits at the same hospital), and reference models are compared. average results are shown for each target data stream type. error bars are standard errors. doi:10.1371/journal.pmed.0040210.g011 all locations probably corresponds to a population surge, as an outbreak would not be expected to cause an increase in all syndrome groups. this signal would be ignored by all network models, but would be detected by the reference model. the next four classes of signals involve decreases in one or more data streams. all of these signals are unlikely to be indicative of an outbreak, but are important for maintaining situational awareness in certain critical situations. as mentioned above, a significant decrease in a context data stream has the potential to trigger a false alarm in the target data stream, especially in networks with few context nodes. this is particularly true in two-node networks, where there is only one context data stream. (5) a fall in one syndrome group at a single location does not have an obvious interpretation. all models will ignore such a signal, since they are set to alarm on increases only. (6) a fall in all syndrome groups at a single location could represent a geographical shift in utilization (e.g., a local quarantine). all models will ignore such a signal. the baselines of all models will be affected, except for network models that include only nodes from single locations. (7) a fall in one syndrome group at all locations may represent a frightened public. all models will ignore such a signal. the baselines of all models will be affected, except for network models that include only nodes from single syndromic groups. (8) a fall in all data types at all locations may represent a regional population decrease or a frightened public staying away from hospitals out of concern for nosocomial infection (e.g., during an influenza pandemic). all models will ignore such a signal. the baseline of only the reference model will be affected. from this overview, it is clear that the network models are more robust than the reference model, with fewer false alarms (in scenarios 2 and 4) and less vulnerability to irregularities in baselines (in scenarios [6] [7] [8] . based on the results obtained, when constructing epidemiological networks for monitoring a particular epidemiological data stream, we recommend prioritizing the inclusion of a total visits from all other hospitals, followed by total visits from the same hospital, followed by data streams of the same syndrome group from other hospitals and streams of different syndrome groups from the same hospital, followed by data streams of different syndrome groups from different hospitals. we further recommend that, in addition to fullnetwork models, homogeneous network models (e.g., only respiratory nodes from multiple hospitals) be maintained for greater stability in the face of major targeted shifts in healthcare utilization. the two-node networks described above are similar in certain ways to the ''rate''-based approach used by a small number of surveillance systems today [24] [25] [26] [27] . instead of monitoring daily counts directly, these systems monitor daily counts as a proportion of the total counts. for example, the respiratory-related visits at a certain hospital could be tracked as a percentage of the total number of visits to that hospital, or alternatively, as a percentage of the total number of respiratory visits in the region. these ''rate''-based approaches have been proposed where absolute daily counts are too unstable for modeling [24] , or where population-atrisk numbers are not available for use in spatiotemporal scan statistics [25] . the approach presented here is fundamentally different in that it explicitly models and tracks all possible inter-data stream relationships, not just those between a particular data stream and its corresponding total-visits data stream. furthermore, the present approach is motivated by the desire to increase robustness in the face of large shifts in health-care utilization that may occur during epidemics or major public events. as such, this study includes a systematic study of the models' responses to different magnitudes of both broad and targeted baseline shifts. the two-node networks described above are an example of this general class of ''rate''-based models. while the two-node approach works well under normal conditions, it is not as robust to targeted shifts in health-care utilization as larger network models. the results therefore show that there is value in modeling all, or a selected combination of the relationships among health-care data streams, not just the relationship between a data stream and its corresponding total-visits data stream. modeling all these relationships involves an order-n expansion of the number of models maintained internally by the system: n 2 à n models are used to monitor n data streams. the additional information inherent in this larger space is extracted to improve detection performance, after which the individual model outputs are collapsed back to form the n integrated outputs of the system. since the number of models grows quadratically with the number of data streams, n, the method can become computationally intensive for large numbers of streams. in such a case, the number of models could be minimized by, for example, constructing only networks that include nodes from different figure 12 . simulation of the effects of the worried-well flooding hospitals during a pandemic to simulate the effects of the worried-well flooding hospitals during a pandemic, a targeted rise is introduced in only one type of data stream. full networks, respiratory-or gastrointestinal-only networks, two-node networks, and reference models are compared. error bars are standard errors. doi:10.1371/journal.pmed.0040210.g012 syndrome groups but from the same hospital, or alternatively, including all context nodes from the same hospital and only total-visit nodes from other hospitals. this work is different from other recent epidemiological research that has described simulated contact networks of individual people moving about in a regional environment and transmitting infectious diseases from one person to another. these simulations model the rate of spread of an infection under various conditions and interventions and help prepare for emergency scenarios by evaluating different health policies. on the other hand, we studied relational networks of hospitals monitoring health-care utilization in a regional environment, for the purpose of detecting localized outbreaks in a timely fashion and maintaining situational awareness under various conditions. our work is also focused on generating an integrated network view of an entire healthcare environment. limitations of this study include the use of simulated infectious disease outbreaks and baseline shifts. we use a realistic outbreak shape and baseline shift pattern, and perform simulation experiments varying the magnitudes of both of these. while other outbreak shapes and baseline shift patterns are possible, this approach allows us to create a paradigmatic set of conditions for evaluating the relative outbreak-detection performance of the various approaches [21] . another possible limitation is that even though our findings are based on data across multiple disease categories (syndromes), multiple hospitals, and multiple years, relationships between epidemiological data streams may be different in other data environments. also, our methods are focused on temporal modeling, and therefore do not have an explicit geospatial representation of patient location, even though grouping the data by hospital does preserve a certain degree of geospatial information. the specific temporal modeling approach used requires a solid base of historical data for the training set. however, this modeling approach is not integral to the network strategy, and one could build an operational network by using other temporal modeling approaches. furthermore, as advanced disease-surveillance systems grow to monitor an increasing number of data streams, the risk of information overload increases. to address this problem, attempts to integrate information from multiple data streams have largely focused on detecting the multiple effects of a single outbreak across many data streams [28] [29] [30] [31] . the approach described here is fundamentally different in that it focuses on detecting outbreaks in one data stream by monitoring fluctuations in its relationships to the other data streams, although it can also be used for detecting outbreaks that affect multiple data streams. we recommend using the network approaches described here alongside current approaches to realize the complementary benefits of both. these findings suggest areas for future investigation. there are inherent time lags among epidemiological data streams: for example, pediatric data have been found to lead adult data in respiratory visits [32] . while the approach described here may implicitly model these relative time lags, future approaches can include explicit modeling of relative temporal relationships among data streams. it is also possible to develop this method further to track outbreaks in multiple hospitals and syndrome groups. it is further possible to study the effects on timeliness of detection of different network approaches. also, while we show the utility of the network approach for monitoring disease patterns on a regional basis, networks constructed from national or global data may help reveal important trends at wider scales. editors' summary background. the main task of public-health officials is to promote health in communities around the world. to do this, they need to monitor human health continually, so that any outbreaks (epidemics) of infectious diseases (particularly global epidemics or pandemics) or any bioterrorist attacks can be detected and dealt with quickly. in recent years, advanced disease-surveillance systems have been introduced that analyze data on hospital visits, purchases of drugs, and the use of laboratory tests to look for tell-tale signs of disease outbreaks. these surveillance systems work by comparing current data on the use of health-care resources with historical data or by identifying sudden increases in the use of these resources. so, for example, more doctors asking for tests for salmonella than in the past might presage an outbreak of food poisoning, and a sudden rise in people buying overthe-counter flu remedies might indicate the start of an influenza pandemic. why was this study done? existing disease-surveillance systems don't always detect disease outbreaks, particularly in situations where there are shifts in the baseline patterns of health-care use. for example, during an epidemic, people might stay away from hospitals because of the fear of becoming infected, whereas after a suspected bioterrorist attack with an infectious agent, hospitals might be flooded with ''worried well'' (healthy people who think they have been exposed to the agent). baseline shifts like these might prevent the detection of increased illness caused by the epidemic or the bioterrorist attack. localized population surges associated with major public events (for example, the olympics) are also likely to reduce the ability of existing surveillance systems to detect infectious disease outbreaks. in this study, the researchers developed a new class of surveillance systems called ''epidemiological network models.'' these systems aim to improve the detection of disease outbreaks by monitoring fluctuations in the relationships between information detailing the use of various health-care resources over time (data streams). what did the researchers do and find? the researchers used data collected over a 3-y period from five boston hospitals on visits for respiratory (breathing) problems and for gastrointestinal (stomach and gut) problems, and on total visits (15 data streams in total), to construct a network model that included all the possible pair-wise comparisons between the data streams. they tested this model by comparing its ability to detect simulated disease outbreaks implanted into data collected over an additional year with that of a reference model based on individual data streams. the network approach, they report, was better at detecting localized outbreaks of respiratory and gastrointestinal disease than the reference approach. to investigate how well the network model dealt with baseline shifts in the use of health-care resources, the researchers then added in a large population surge. the detection performance of the reference model decreased in this test, but the performance of the complete network model and of models that included relationships between only some of the data streams remained stable. finally, the researchers tested what would happen in a situation where there were large numbers of ''worried well.'' again, the network models detected disease outbreaks consistently better than the reference model. what do these findings mean? these findings suggest that epidemiological network systems that monitor the relationships between health-care resource-utilization data streams might detect disease outbreaks better than current systems under normal conditions and might be less affected by unpredictable shifts in the baseline data. however, because the tests of the new class of surveillance system reported here used simulated infectious disease outbreaks and baseline shifts, the network models may behave differently in real-life situations or if built using data from other hospitals. nevertheless, these findings strongly suggest that public-health officials, provided they have sufficient computer power at their disposal, might improve their ability to detect disease outbreaks by using epidemiological network systems alongside their current disease-surveillance systems. additional information. please access these web sites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed. 0040210. wikipedia pages on public health (note that wikipedia is a free online encyclopedia that anyone can edit, and is available in several languages) a brief description from the world health organization of public-health surveillance (in english, french, spanish, russian, arabic, and chinese) a detailed report from the us centers for disease control and prevention called ''framework for evaluating public health surveillance systems for the early detection of outbreaks'' the international society for disease surveillance web site containing pandemic influenza at the source strategies for containing an emerging influenza pandemic in southeast asia public health vaccination policies for containing an anthrax outbreak world health organization writing group (2006) nonpharmaceutical interventions for pandemic influenza, national and community measures transmissibility of 1918 pandemic influenza syndromic surveillance for influenzalike illness in ambulatory care network sars surveillance during emergency public health response planning for smallpox outbreaks systematic review: surveillance systems for early detection of bioterrorismrelated diseases implementing syndromic surveillance: a practical guide informed by the early experience national retail data monitor for public health surveillance using laboratory-based surveillance data for prevention: an algorithm for detecting salmonella outbreaks sars-related perceptions in hong kong utilization of ontario's health system during the 2003 sars outbreak. toronto: institute for clinical and evaluative sciences pandemic influenza preparedness and mitigation in refugee and displaced populations. who guidelines for humanitarian agencies medical care delivery at the 1996 olympic games algorithm for statistical detection of peaks-syndromic surveillance system for the athens biosense: implementation of a national early event detection and situational awareness system time series modeling for syndromic surveillance using temporal context to improve biosurveillance measuring outbreak-detection performance by using controlled feature set simulations the distribution of incubation periods of infectious disease harvard team suggests route to better bioterror alerts can syndromic surveillance data detect local outbreaks of communicable disease? a model using a historical cryptosporidiosis outbreak a space-time permutation scan statistic for disease outbreak detection syndromic surveillance in public health practice: the new york city emergency department system monitoring over-the-counter pharmacy sales for early outbreak detection in new york city algorithms for rapid outbreak detection: a research synthesis integrating syndromic surveillance data across multiple locations: effects on outbreak detection performance public health monitoring tools for multiple data streams bivariate method for spatio-temporal syndromic surveillance identifying pediatric age groups for influenza vaccination using a real-time regional surveillance system the authors thank john brownstein of harvard medical school for helpful comments on the manuscript.author contributions. byr, kdm, and isk wrote the paper and analyzed and interpreted the data. byr and kdm designed the study, byr performed experiments, kdm and byr collected data, and isk suggested particular methods to be used in the data analysis. key: cord-018054-w863h0d3 authors: mirchev, miroslav; kocarev, ljupco title: non-poisson processes of email virus propagation date: 2010 journal: ict innovations 2009 doi: 10.1007/978-3-642-10781-8_20 sha: doc_id: 18054 cord_uid: w863h0d3 email viruses are one of the main security problems in the internet. in order to stop a computer virus outbreak, we need to understand email interactions between individuals. most of the spreading models assume that users interact uniformly in time following a poisson process, but recent measurements have shown that the intercontact time follows heavy-tailed distribution. the non-poisson nature of contact dynamics results in prevalence decay times significantly larger than predicted by standard poisson process based models. email viruses spread over a logical network defined by email address books. the topology of this network plays important role in the spreading dynamics. recent observations suggest that node degrees in email networks are heavy-tailed distributed and can be modeled as power law network. we propose an email virus propagation model that considers both heavy-tailed intercontact time distribution, and heavy-tailed topology of email networks. the concept of a computer virus is relatively old in the young and expanding field of information security. it was first developed by cohen in [1, 2] , and it still is an active research area. computer viruses still accounts for a significant share of the financial losses that large organizations suffer for computer security problems, and it is expected that future viruses will be even more hostile. according to the wildlist organization international [3] there were 70 widespread computer viruses in july 1993, and that number have increased up to 953 in july 2009 (fig. 1) . with the proliferation of broadband ''always on'' connections, file downloads, instant messaging, bluetooth-enabled mobile devices, and other communications technologies, the mechanisms used by viruses to spread have evolved as well [4, 5] . still, many viruses continue to spread through email. indeed, according to the virus bulletin [6] , the email viruses (email worms) still accounts for large share of the virus prevalence today. email viruses spread via infected email messages. the virus may be in an email attachment or the email may contain a link to an infected website. in the first case the virus will be activated when the user clicks on the attachment and in the second case when the user clicks on the link leading to the infected site. this is a cooperative listing of viruses reported as being in the wild by virus information professionals. the list includes viruses reported by multiple participants, which appear to be nonregional in nature. the wildlist is currently being used as the basis for in-the-wild virus testing and certification of anti-virus products by the icsa, virus bulletin and secure computing. when an email virus infects a machine, it sends an infected email to all addresses in the computer's email address book. this self-broadcast mechanism allows for the virus's rapid reproduction and spread, explaining why email viruses continue to be one of the main security threats. while some email viruses used only email to propagate (e.g. melissa), most email viruses can also use other mechanisms to propagate in order to increase their spreading speed (e.g. w32/sircam, love letter). although virus spreading through email is an old technique, it is still effective and is widely used by current viruses. it is attractive to virus writers, because it doesn't require any security holes in computer operating systems or software, almost everyone uses email, many users have little knowledge of email viruses and trust most email they receive (especially email from friends) and email is private property so correspondent laws or policies are required to permit checking email content. email viruses usually spread by connecting to smtp servers using a library coded into the virus or by using local email client services. viruses collect email addresses from victim computers, in order to spread further, by: scanning the local address book, scanning files with appropriate extensions for email address and sending copies of itself to all mail in the user's mailbox. some viruses even construct new email addresses with common domain names. in order to eradicate viruses, as well as to control and limit the impact of an outbreak, we need to have a detailed and quantitative understanding of the spreading dynamics and environment. in most email virus models have been assumed that the contact process between individuals follows poisson statistics, and, the time between two consecutive contacts is predicted to follow an exponential distribution [7] [8] [9] [10] [11] [12] [13] . therefore, reports of new infections should decay exponentially with a decay time of about a day, or at most a few days [7] [8] [9] [10] [11] . in contrast, prevalence records indicate that new infections are still reported years after the release of antiviruses [4, 7, 14] , and their decay time is in the vicinity of years, 2-3 orders of magnitude larger than the poisson process predicted decay times. this discrepancy is rooted in the failure of the poisson approximation for the interevent time distribution. indeed, recent studies of email exchange records between have shown that the probability density function of the time interval between two consecutive emails sent by the same user is well approximated by a fat tailed distribution [15] [16] [17] [18] [19] . in [20] the authors prove that this deviation from the poisson process has a strong impact on the email virus's spread, offering a coherent explanation of the anomalously long prevalence times observed for email viruses. the email network is determined by users' email address books, and its topology plays important role in the spreading dynamics. in [21] the authors use yahoo email groups to study the email network topology. although the topology of email groups is not the complete email network topology, they use it to figure out what the topology might be like. their findings suggest that the email groups are heavy-tailed distributed, so it is reasonable to believe that email network is also heavy-tailed distributed. the problem of virus spreading in networks with heavy-tailed distribution has been studied in [7, 10, 21] . an epidemic threshold is a critical state beyond which infections become endemic. in [22, 23] , the authors have presented a model that predicts the epidemic threshold of a network with a single parameter, namely, the largest eigenvalue of the adjacency matrix of the network. in this paper, we propose an email virus propagation model with nonlinear dynamical system, which considers both heavy-tailed intercontact time distribution and heavy-tailed topology of email networks. we use this model to reveal new form of the epidemic threshold condition. the rest of the paper is organized as follows. in section 2, we define the network model, and analyze the email network topology and communication patterns. after that in section 3, we propose a discrete stochastic model for non-poisson virus propagation in email networks with power law topology and have-tail distributed interevent times. simulation results and analyses are given in section 4 and section 5 concludes the paper. let g = (v, e) be a connected, undirected graph with n nodes, which represent the email users, and m edges, which represent the contacts between the users. every user has an address book in which he has all the users he contacts with. these address books are represented with the adjacency matrix a of the graph g, i.e., a ij = 1 if (i, j) ∈ e (user i have user j in his address book) and a ij = 0 otherwise. at time k, each node i can be in one of two possible states: s (susceptible) or i (infected). the state of the node is indicated by a status vector which contains a single 1 in the position corresponding to the present status, and 0 in the other position: and let be the probability mass function of node i at time k. for every node i it states the probability of being in each of the possible states at time k. the network topology is determined by the adjacency matrix a, i.e. by the users' email address books. the size of a user's email address book is the degree of the corresponding node in the network graph. since email address books are private property, it is hard to find data to tell us what the exact email topology is like. we use the enron email dataset, described in [24] and available at [25] , to study the email network topology. this set of email messages was made public during the legal investigation concerning the enron corporation. it is the only publicly available email dataset and consists of 158 users (mostly senior management) and 200,399 messages (from which 9728 are between employees). the dataset contains messages from a period of almost three years. on fig. 2 the degree distribution of the users' address books from this dataset is shown. we see that the power law p(k) ~ k -3.5 approximates well a substantial part of the users' degree distribution, but fails to approximate well for small degree values. this is mostly due to the fact that the number of users in the dataset is small, but nevertheless it gives us an insight into the real email network topology. because of this degree distribution, and the findings from [21] , it is best if we model the email network as a power law network. the contact dynamics responsible for the spread of email viruses is driven by the email communication and the usage patterns of individuals. to characterize these patterns we also use the enron email dataset. we use only the messages between the employees (9728 messages), and it is sufficient for accurate analysis. let τ (interevent time) denote the time between two consecutive emails sent by a single user. the distribution of the aggregate interevent of all the users approximately follows a power law with exponent α ≈ 2.4 and a cut-off at large τ values (fig. 3) . the spreading dynamics is jointly determined by the email activity patterns and the topology of the corresponding email communication network. we propose a discrete stochastic model for virus propagation in email network with power law topology and communication pattern with heavy-tailed interevent time distribution. the barabasi-albert model [26] is used for generating email networks with power law topology, which is one of several proposed models that generate power law networks. the model is using a preferential attachment mechanism and generates network which has degree distribution with the power law form p(k) ~ k -3 . in order to compare power law networks against random networks, we use the erdos-renyi model [27] for generating random networks. in this model, a graph g(n, p) is constructed by connecting n nodes randomly. each edge is included in the graph with probability p, with the presence or absence of any two distinct edges in the graph being independent. when an email user have received message with a virus attachment by some of his contacts, he may discard the message (if he suspects the email or detects the email virus by using anti-virus software) or open the virus attachment if unaware of it. when the virus attachment is opened, the virus immediately infects the user and sends out virus email to all email addresses on this user's email address book. different users open virus attachments with different probabilities, depending on their computer security knowledge. we assume that the probability that an email user opens the infected attachment, after he has received some infected message is constant and denote it with β. the infected user will not send out virus email again unless the user receives another copy of the email virus and opens the attachment again. it takes time before a recipient receives a virus email sent out by an infected user, but the email transmission time is usually much smaller comparing to a user's email checking time. thus in our model we ignore the email transmission time. in most cases received emails are responded to in the next email activity burst [15, 17] , and viruses are acting when emails are read, approximately the same time when the next bunch of emails are written. according to this email users' activity can be represented as follows. let b j (k) represent users' j activity at time k. if user j is active at time k b j (k) = 1, otherwise b j (k) = 0. we assume that a user reads all his emails at the moment he is active. we model email users activity by using chaotic-maps. this method is used in [28, 29] for modeling packet traffic. the following map is convenient for our purposes: and d∈ [0, 1]. at each time k, the value of x j (k) is evaluated for each user j, and then: we choose this chaotic map, because for values of m 1 and/or m 2 in the range (3/2, 2) the map generates interevent times that have heavy tailed distribution. more precisely for d=0.7, m 1 = 1.53 and m 2 = 1.96 the distribution approximately follows a power law with exponent α ≈ 2.4 and a cut-off at large τ values, very similar to the true interevent time distribution (this can be achieved with other values also). at the beginning (k = 0) there is a small number of initially infected users. let v(k) denote the infected inbox matrix, where v ij (k) = 1, if user j have unread infected email message from user i at time k, and otherwise v ij (k) = 0. at time k = 0, v ij (0) = 1, if user j have initially infected user i in his address book, and otherwise v ij (0) = 0. at each time k: previously we assumed that a user reads all his emails at the moment he is active. so if user j is active at time k (b j (k) = 1), all the messages from the infected inbox matrix v should be removed, v ij (k+1) = 0 for all i. we introduce another parameter δ, which represents the curing probability. after some user gets infected, he may use some means (such as virus removal tool) to remove the virus from his computer. as with β we assume constant curing probability among users. having defined all this, the equations describing the evolution of our email virus propagation model are: (9) where multirealize[.] performs a random realization for the probability distribution given with ) 1 ( + k t i p , and: for our simulations, we use email networks with 1000 nodes representing the email users and 3000 links representing the users' address books. first, we compare the spreading of email viruses in power law and random (erdos-renyi) network, by using both poisson process approximation and true interevent distribution. for this simulation we use δ = 0, because we are interested in the spreading dynamics, i.e. the number of new infections, instead of the total number of infected users. the other parameter values are d=0.7, m 1 = 1.53, m 2 = 1.96 and β = 0.5. from fig. 4 we see that the spreading process in the power law email network evolves more rapidly than in random network, i.e. the number of new infections at the beginning is higher. if we compare the different interevent distributions, we see that the poisson process approximation evolves much faster and the spreading process ends in one order of magnitude faster than in the true interevent time distribution. the number of new infections in power law networks, after the initial period, slightly deviates from exponential decay, while in random networks the decay is clearly exponential. predicting the epidemic threshold condition is an important part of a virus propagation model. in [22, 23] the authors predict the epidemic threshold with a single parameter λ 1,a , the largest eigenvalue of the adjacency matrix a of the network. they prove that if an epidemic dies out, then it is necessarily true that: after the initial period, the lines correspond to an exponential decay predicted by the poisson process approximation (dash lines) and the true interevent distribution (solid lines). the epidemic threshold in power law networks is zero [22] , so we make the epidemic threshold analysis on random networks. we analyze the dependencies between the parameters, β, δ, λ 1,a and d at their threshold values, i.e. the values for which the system moves from a state where the virus prevails, to a state where the virus diminishes). the parameter d captures the characteristics of the communication pattern. we see (fig. 5 ) that as in [22, 23] β and δ have linear dependency with λ 1,a , while the threshold value of d exponentially increases, as λ 1,a increases. according to this, the epidemic threshold condition would have the form given in (12) , which captures the essence of both network topology and communication patterns. in this paper we analyzed the email network topology and the email communication patterns. we proposed a model for virus propagation in email network with power law topology and communication pattern with heavy-tailed interevent time distribution. the analysis showed that the prevalence time for true interevent time distribution is much longer than predicted by standard poisson based models, which is coincident with real data. although the number of new infections exponentially decays in random networks, for email networks it slightly deviates from straight exponential decay. the epidemic threshold analysis has revealed a new form of the condition under which an epidemic diminishes, which captures the essence of both network topology and communication patterns. this form will be further analyzed. computer viruses computer viruses -theory and experiments understanding the spreading patterns of mobile phone viruses wifi networks and malware epidemiology epidemic spreading in scale-free networks network theory and sars: predicting outbreak diversity epidemic outbreaks in complex heterogeneous networks velocity and hierarchical spread of epidemic outbreaks in scale-free networks complex networks: structure and dynamics dynamics of rumor spreading in complex networks theory of rumor spreading in complex social networks evolution and structure of the internet: a statistical physics approach entropy of dialogues creates coherent structures in email traffic probing human response times modeling bursts and heavy tails in human dynamics impact of memory on human dynamics exact results for the barabási model of human dynamics impact of non-poissonian activity patterns on spreading processes email virus propagation modeling and analysis epidemic spreading in real networks: an eigenvalue viewpoint epidemic thresholds in real networks introducing the enron corpus emergence of scaling in random networks the evolution of random graphs self-similar traffic and network dynamics an application of deterministic chaotic maps to model packet traffic key: cord-262100-z6uv32a0 authors: wang, yuanyuan; hu, zhishan; feng, yi; wilson, amanda; chen, runsen title: changes in network centrality of psychopathology symptoms between the covid-19 outbreak and after peak date: 2020-09-14 journal: mol psychiatry doi: 10.1038/s41380-020-00881-6 sha: doc_id: 262100 cord_uid: z6uv32a0 the current study investigated the mechanism and changes in psychopathology symptoms throughout the covid-19 outbreak and after peak. two studies were conducted separately in china during outbreak and the after peak stages, with 2540 participants were recruited from february 6 to 16, 2020, and 2543 participants were recruited from april 25 to may 5, 2020. the network models were created to explore the relationship between psychopathology symptoms both within and across anxiety and depression, with anxiety measured by the generalized anxiety disorder-7 and depression measured by the patient health questionnaire-9. symptom network analysis was conducted to evaluate network and bridge centrality, and the network properties were compared between the outbreak and after peak. noticeably, psychomotor symptoms such as impaired motor skills, restlessness, and inability to relax exhibited high centrality during the outbreak, which still relatively high but showed substantial remission during after peak stage (in terms of strength, betweenness, or bridge centrality). meanwhile, symptoms of irritability (strength, betweenness, or bridge centrality) and loss of energy (bridge centrality) played an important role in the network after the peak of the pandemic. this study provides novel insights into the changes in central features during the different covid-19 stages and highlights motor-related symptoms as bridge symptoms, which could activate the connection between anxiety and depression. the results revealed that restrictions on movement were associated with worsen in psychomotor symptoms, indicating that future psychological interventions should target motor-related symptoms as priority. the covid-19 pandemic has caused substantial threats to people's physical health and lives, as well as triggered psychological distresses such as anxiety and depression [1] . unlike previous infections, worldwide mass media reports have highlighted the unique threat of covid-19, increasing people's psychological distress and panic [2] . covid-19 is considered highly contagious and currently there is no targeted medical treatment available, instead reducing exposure to the virus is considered to be the best prevention strategy [2] . however, the negative effects of covid-19 on mental health could be exacerbated by prevention-related measures, such as social distancing and isolation, resulting in a continued fear and panic toward the virus [3] . therefore, timely mental health care has been required during this pandemic [4] . in order to provide the general public with appropriate mental health care, researchers have made an urgent call for guidance and practical evidence to inform the creation of both health and psychological interventions [5] . a number of recent studies have focused on mental health problems during covid-19, with the most frequently reported symptoms being depression and anxiety aspects [1, 6, 7] . a meta-analysis on the mental health within the general population during the covid-19 pandemic reported the prevalence of anxiety to be 31.9% (95% ci: 27.5-36.7) and the prevalence of depression as 33.7% (95% ci: 27.5-40.6) [8] . when understanding mental health problems, co-occurrence becomes a complex and principal issue in regards to treatment adherence and engagement in prevention measures [9] . considerations to better understand co-occurrence during the pandemic are required. depression and anxiety are commonly co-occur at high rates, with a co-occurrence of depression and anxiety resulting in more severe and chronic psychopathology [10, 11] . several theoretical models have been proposed to explain the co-occurrence of anxiety and depression; the diathesis-stress model proposes a simultaneously development of symptoms and left untreated anxiety could increase the risk of depressive disorders and vice versa [12] [13] [14] [15] . however, there is no universal agreement to explain the cooccurrence of anxiety and depression. in order to further investigate the relationship between anxiety and depression, the current study applied network analysis. to interpret the mechanisms of any underlying psychopathology and develop effective interventions, it is essential to characterize the interactions between the two different mental disorders. network models describe mental disorders using an interacting web of symptoms, which can offer new insight into co-occurrence [9, 16] . according to network theory, the symptoms of a mental disorder can lead to development of another disorder; the co-occurrence belongs to a dynamic network of symptoms that cause, sustain, and underlie the symptomology [17, 18] . bridge symptoms can be regarded as the symptoms that connect two mental health disorders, and the activation of the bridge symptoms increase the risk of symptoms transferring from one disorder to another [9] . thus, the identification of bridge symptoms between depression and anxiety could provide meaningful clinical implications to prevent cooccurrence. this could be done through applying targeted and prioritized treatment for bridge symptoms to control and prevent activation that can lead to the co-occurring symptoms between depression and anxiety. during the pandemic, there has been a dramatic decreases in individuals' social activities [2] . considering the preventative measures of quarantine, social distancing, and lockdown, people's mobile-related activities have been largely reduced. it is likely that motor-related symptoms could then be considered bridge symptoms between anxiety and depression. to understand how symptoms change over time, several studies have focused on psychologically related distresses during different covid-19 stages, with a lack of consensus within the studies' results. in a recent longitudinal study on mental health during covid-19, no significant changes in anxiety and depression were found in the general chinese population between the initial outbreak and the after peak period [6] . on the other hand, qiu et al. [1] conducted a national survey among chinese individuals and found that the distress caused by covid-19 decreased significantly over time among the general public. however, the existing studies did not investigate the mechanism and changes in anxiety and depressive symptoms throughout the covid-19 outbreak and the after peak using network analysis. a recently developed symptom network perspective has highlighted the importance of not only measuring whether symptoms have changed but measuring the interactions between individual symptoms [19] [20] [21] . using network analysis may then provide a more in-depth understanding on the dynamic changes between symptoms of depression and anxiety at different points throughout the pandemic. the researchers aimed to assess the interactions between anxiety and depressive symptom over the outbreak and peak of covid-19, and to identify the bridge symptoms (i.e., depressive symptoms with strong associations with anxiety symptoms) using network analysis. considering the covid-19-related prevention measures of social distancing and isolation, we hypothesized that motor-related symptoms would be the bridge symptoms between depression and anxiety. the current survey included a total of 5274 chinese participants who completed a surveyed via "wenjuanxing," a chinese online platform providing functions equivalent to qualtrics. the location was verified by participants' cellphone gps trackers. to avoid duplication of data, each ip address was only granted access once to complete the questionnaire. detailed data collection information, inclusion and exclusion criteria, and demographic information are described in supplementary information. a total of 5083 participants were included in the analysis. specifically, 2540 participants (mean age = 25.28 ± 8.07, education years = 15.93 ± 1.82) were surveyed during the outbreak stage from february 6 to 16, 2020 (fig. 1) . and, 2543 participants (mean age = 22.03 ± 6.30, education years = 15.97 ± 1.26) were surveyed during the after peak stage. the study was approved by the ethics committee of central university of finance and economics and the second xiangya hospital of central south university. the patient health questionnaire-9 (phq-9) depression symptoms were assessed via the nine-item phq-9 [22] . the items of phq-9 and their reference names are listed in table s1 . the scales for the questionnaire are in a four-point likert format where participants evaluate their symptoms on a scale from 0 (not at all) to 3 (nearly every day), with higher scores indicating severe symptoms. the validated chinese version uses a cutoff score of 5 to determine whether a participant had mild depression symptoms, and the same cutoff score was used for this study [22] [23] [24] . the cronbach's alpha was 0.915. anxiety symptoms were assessed using the seven-item gad-7 scale [25] . the items of gad-7 and their reference names are listed in table s1 . the scales consist of a fourpoint likert format, in which participants evaluate their symptoms on a scale from 0 (not at all) to 3 (nearly every day), with higher scores indicating severe symptoms. the validated chinese version uses a cutoff score of 5 to determine whether a participant has at least mild anxiety symptoms, and was also used to determine the cutoff score for this study [23, 26, 27] . the cronbach's alpha was 0.941. the changes of sum scores for depression and anxiety were compared, respectively, between the outbreak and after peak stages using two-tailed independent t-tests, with the significance level set as 0.05. the network analysis was then performed in the aspects of network estimation, network stability, and network differences [28] . in accordance with network parlance, the scores of the items were considered as nodes and the pair-wise correlations between these scores were considered as edges [18, [29] [30] [31] [32] . to estimate the symptom network illustrating the relationship between depression and anxiety symptoms, pair-wise pearson correlations were run and a sparse gaussian graphical model with the graphical lasso was performed to estimate the network [33] . the tuning parameter was decided upon using the extended bayesian information criterium [34] . within this procedure, symptom networks at outbreak and after peak stages were estimated. the r package "bootnet" was utilized to complete this analysis [35] . the network structure was characterized by network centrality indices, this is where each node is placed within a weighted network, i.e., strength, closeness, and betweenness [36, 37] . specifically, strength is the sum of edge weights directly connected to a node, which measures the importance of a symptom in the network. closeness is the inverse of the average shortest path length between a node and other nodes, it measures how close the symptom is linked to other symptoms. betweenness is the number of times that the shortest path between any two nodes passes through another node and measures the importance of the symptom in linking to other symptoms. the "centrality plot" function from "qgraph" package in r was used to complete this analysis [38] . the role of a symptom as a bridge between anxiety and depressive symptoms was also assessed. similar to the network centrality, the bridge centrality, which includes bridge strength, bridge closeness, and bridge betweenness, of each symptom was analyzed. the only difference between network and bridge centrality is that the associated two symptoms, as mentioned above, are from different disorders. the bridge centrality of the nodes measures the importance of a symptom in linking two mental health disorders. the complete this analysis the r package "networktools" [39] was used. after checking the stability of the network structure (see supplementary information), the symptom connections and the network properties, as mentioned above, were compared. the comparison was between the outbreak and the after peak stages to allow for any symptom network changes caused by the pandemic to be quantified. the differences were quantified using permutation tests with 1000 iterations [40, 41] using the r package "network comparison test" [42] . specifically, participants were randomly assigned into two group (within the outbreak group and the same within the after peak group). then the symptom networks were constructed, estimated, and compared using a bootstrap method of resampling by repeating 1000 times to get the null distribution of the network differences under the null hypothesis. the significance level was set as 0.05. in addition, the network differences in both edge and network properties, in global and local level, were compared. the global differences in edge weights were measured by the largest difference in paired edges between two networks. meanwhile, the local edge weight differences were also separately measured. in addition, the global difference in strength was measured by the difference between average strength. finally, the differences in local network properties were also measured separately. the severity of each disorder, between outbreak and after peak stages, was compared. it was found that participants at the after peak stage were more depressed than that at the outbreak stage (phq-9, m after peak = 4.72, m outbreak = 4.17, t 5075.5 = 4.0313, p < 0.001). however, the anxiety disorder scale scores (gad-7) showed no difference between these two stages (m after peak = 3.60, m outbreak = 3.57). using the cutoff score of 5 (at least experiencing mild depression and anxiety symptoms), after the peak stage, 42.94% of the participants showed depression symptoms, which is significantly higher (χ 2 = 24.29, p < 0.001) than that in outbreak stage (36.14%). meanwhile, we found more participants showed anxiety symptoms (χ 2 = 10.57, p = 0.001) after peak (36.41%), compared to the outbreak stage (32.05%). the estimated networks are displayed in fig. 1 . detailed edges weights are listed in tables s2 and s3. the symptom network at outbreak stage showed different patterns regarding the number and thickness of the edges. before characterizing the network properties and quantifying the property differences, the stability of the symptom networks during outbreak and after peak stages was evaluated by using the bootstrap method, results are displayed in figs. s1 and 2. these figures showed that most of the edges and centrality were stable. detailed results are provided in supplementary information. therefore, the network differences between the outbreak and after peak stages reflect solid changes of the psychological interaction patterns that were caused by the pandemic. the network differences in both edge and network properties were compared. no global differences were found between networks from outbreak and after peak stages. globally, according to the permutation test, the maximum difference (diff, contrast: after peak − outbreak, same below) between stages in any of the edge weights from both networks was not significant (the maximum difference in edge was between "afraid" and "inability to relax" symptoms from current networks, diff = −0.16, p = 0.20). meanwhile, the global strength difference between outbreak (global strength = 8.38) and after peak (global strength = 8.24) stages was also found as not significant (p = 0.70). however, local differences were found in multiple edges and nodes. locally, the networks at outbreak and after peak stages differed not only in symptom connections (edge weights), but also in network properties (network and bridge centrality). specifically, for the edge weights, the significant positive and negative correlations were visualized separately in fig. 3 (p < 0.05). at the after peak stage, insomnia symptom from the phq-9 showed stronger connections with impaired motor skills and changes in appetite symptoms from the gad-7 as well as with nervous symptoms from the phq-9. no decreased connections with other symptoms were shown. by contrast, the symptom of inability to relax from the gad-7 showed a decreased connection with symptoms of being afraid, restless, and irritable from the gad-7 and also with suicidal thoughts and guilty symptoms from the phq-9. there were no increased connections with other symptoms shown. it should also be noted that during the after peak stage, compared to the outbreak stage, suicidal thoughts showed a decreased connection with "inability to relax" and "guilty" symptoms, whereas suicidal thoughts showed an increased connection with the "too much worry" symptom. the decreased connection between feeling guilty and suicidal thoughts from the outbreak stage to the after peak stage is also illustrated in fig. s1 , in which the edge weights, no matter if from the current sample or bootstrapped sample, ranked at the top in the outbreak stage and dropped to number nine in the after peak stage. for the network properties, bar plots indicate the network and bridge centrality of each symptom in each stage as displayed in fig. 4 . during the outbreak, psychomotor symptoms such as impaired "motor skills, restless, and inability to relax" exhibited high network betweenness and bridge betweenness. while during the after peak stage, although these symptoms decreased, they were still relatively high when compared with other symptoms. these symptoms might not necessarily exhibit intensive connections with other symptoms. however, they stand between the associated symptoms, which may have played a key role as a mediator that regulated the connections between the symptoms in the network [43] . moreover, besides these symptoms, several other symptoms also showed increased network and bridge centrality during the after peak stage. in specific, using permutation tests, it was found that the "inability to relax" symptom showed a decreased strength at the after peak stage (diff = −0.16, p = 0.03) when compared to the outbreak stage. meanwhile, the "restlessness" symptom exhibited decreased betweenness (diff = −25, p = 0.04) and the "impaired motor sills symptom" showed decreased betweenness (diff = −26, p = 0.01), bridge closeness (diff = −0.021, p = 0.048), and bridge betweenness (diff = −27, p = 0.01). by contrast, the "irritable" symptom showed increased strength (diff = 0.22, p = 0.02), betweenness (diff = 14, p = 0.03), and bridge betweenness (diff = 14, p = 0.02) during the after peak stage, compared to the outbreak stage. meanwhile, the "loss of energy" symptom showed increased bridge closeness (diff = 0.016, p = 0.03) and bridge betweenness (diff = 11, p = 0.02). the novelty of the current study was to evaluate the psychopathological symptom changes between the outbreak fig. 3 edges exhibiting significant differences between outbreak and after peak stages. the green nodes denote the gad-7 items and the orange nodes denote the phq-9 items. meanwhile, the blue edges denote the increased correlations between items at the after peak stage when compared with those in the outbreak stage and the red edges denote the decreased ones. and after peak in china, which have significant implications for other countries that still have not reached their after peak. the current study identified the bridge symptoms and aimed to identify the risks of co-occurrence between anxiety and depressive symptoms during different phrases of covid-19 to prevent increasing psychological distress. the network differences and changes between outbreak and after peak stages showed the impact of the covid-19 pandemic on psychological interaction patterns. the prevalence of anxiety and depression in this study during outbreak was 32.05% and 36.14%, and during after peak phase was 36.41% and 42.94%. similar to the metaanalysis on depression and anxiety during the covid-19, over one-third of the population suffered from anxiety and depressive symptoms [8] . researchers have suggested that the mental health consequences of covid-19 could last over time and that mental health problems could peak later than the actual pandemic [3] . our results were consistent with the prediction and showed that after the covid-19 peak the prevalence of depression and anxiety increased. this could due to the far-reaching influences of covid-19, such as the induced economic uncertainty, the fear of economic crisis and recession, and increased unemployment [44, 45] . these aftereffects could all work toward increasing anxiety and depression after the actual pandemic peak. research has noted that different mental health problems have emerged during the covid-19 outbreak, which mainly included anxiety and depression [46] . previous research has examined the symptoms of anxiety and depression using network analysis in psychiatric patients and found that sad mood and worry were the most central symptoms in the network [28] . in the current study, during the outbreak stage psychomotor symptoms such as impaired motor skills, restlessness, and inability to relax were the most central symptoms in the network. during the after peak stage these symptoms showed a decreased centrality but were still relatively high when compared with other symptoms. in addition, the irritable symptom showed increased centrality during the after peak stage. that is, after the peak time, psychomotor centrality decreased, while the mental health problems were more severe due to the contributions from other non-psychomotor-related aspects. after the pandemic peak time, normal social activities started to resume. this could explain why people's physical-and motor-related activities began to show normality as the psychomotor-related symptoms would be eased. however, the mental health problems caused by the pandemic could have prolonged effects [3] and people might be anxious and depressed from other nonpsychomotor aspects. during the covid-19 period, there was a perceived decrease in physical-related activities [2] , which correspond with the central symptoms identified from the data. compared with the non-symptomatic group, depressed patients presented disturbances in psychomotor symptoms in terms of motor activities, body movement, and motor reaction time [47] [48] [49] . researchers have proposed that psychomotor symptoms may have unique significance in depression, which could explain the psychomotor manifestations and pathophysiologic significance of depression [47] . restless-agitation in anxiety is also related to psychomotor functions, in which the higher level of restlessagitation indicated more severe anxiety [25] . after assessing the interactions between anxiety and depressive symptoms, it was identified that the bridge symptoms during the outbreak also focused on psychomotor symptoms such as impaired motor skills, restlessness, and inability to relax. in particular, the impaired motor skill symptoms showed a significant decrease in bridge centrality during the after peak phase, although it was still relatively high when compared with other symptoms. meanwhile, it was also observed that the inability to relax showed decreased connections with being afraid, restlessness, suicidal thoughts, and feelings of guilt. in addition, during the after peak phase, other bridge symptoms such as irritable and loss of energy emerged, which showed higher bridge centrality than the outbreak stage. in a risky network, the connections among symptoms are tight and strong, and the activation of one symptoms could lead to others, resulting in more severe consequences [28, 30] . during the outbreak and after peak, the occurrence of either impaired motor skills with depression symptoms or restlessness with anxiety symptoms could increase the risk of activation for other mental disorders. this was different from a previous study conducted during the pre-pandemic period. previous network analysis has shown that the association of anxiety and depression can be attributed to the strong connection from anxious worrying to sleep problems and difficulty concentrating [50] . our results also indicated that during the after peak insomnia showed enhanced connections with appetite changes, impaired motor skills, and nervous symptoms. compared to the non-pandemic period, there have been a wide-scale lockdown and restrictions on transportation during the covid-19 pandemic. the beneficial effects of physical health on mental health have been welldocumented in research [51, 52] . covid-19 is having a negative impact on people's physical activity on a global level [53, 54] . recent covid-19 research in psychiatric patients also reported that poor physical health was related with higher levels of anxiety and depression [55] . this could explain why the impaired motor skills aspect and restlessness become the bridge symptoms between anxiety and depression. depression and anxiety are frequently co-occurring mental disorders, and previous research has indicated the likelihood of a causal relationship between these two mood disorders [50] . a cognitive neuroscience study using default model network (dmn) indicated that cortical areas of the dmn showed functional connectivity associated with anxiety and depression [56] . similar to previous studies, the current study cannot confirm the causal relation between anxiety and depression. however, the current network analysis can be utilized in clinical practice during the covid-19 period. a previous study suggested that interventions should focus on depression and anxiety symptoms which are most closely related to other symptoms, since those symptoms should theoretically decrease the associated risk [57] . moreover, symptoms with a high centrality may also have crucial roles in the network [58] . those core symptoms could have important roles in maintaining the psychopathology network and treating those symptoms could help to cure the psychopathology. that is, for treating covid-19-related mood problems, the study results suggest clinical practitioners to focus on the symptoms highlighted by our network analysis. researchers have expressed concern about the consequence of mental disorders resulting from the covid-19 pandemic [59] and mental health professionals have speculated a globe increase of mental disorders due to the impact of covid-19 [60, 61] . who has also mentioned that covid-19 related specific measures, such as self-isolation, quarantine, and social distancing, might increase loneliness and mood-related problems such as anxiety and depression in people [62] . our results showed that during the after peak phase, the impaired motor-skill-related symptoms were still prominent. it is hard to predict the duration of the covid-19 crisis, especially as cities such as leicester, united kingdom [63] are undergoing a second lockdown. it is possible that impaired motor-skill-related symptoms could persistent in people in the second lockdown control zones. during the covid-19 lockdown, physical health professionals have recommended people to stay active with home-based physical activities in order to maintain their health, engaging in activities such as aerobic exercise training and body weight training [53] . a healthy lifestyle and regular exercise are associated with an enhanced immune system [2] , which could help protect people from covid-19-related health problems. this study suggests that health professionals could provide tailored and practical suggestions for the general population by targeting mood symptoms through exercise as a prevention or as a treatment strategy. researchers have proposed to use mindfulness-based stress reduction practices to improve mental health during the covid-19 [64] [65] [66] . in the current literature, mindfulness-based interventions have shown effectiveness in reducing anxiety and depression [67, 68] . there are several limitations to the study that should be acknowledged. first, depression and anxiety were measured by self-reported questionnaires rather than systematic diagnosis. second, this network analysis on depression and anxiety focused specifically on the covid-19 pandemic and cannot be generalized to non-pandemic times. therefore, the central symptoms and bridge symptoms identified in the current study may not applicable during other periods. third, the age of the participants was relatively young. fourth, due to the cross-sectional design, causal relationship could not be established. future longitudinal studies are needed to investigate the causal relationship between anxiety and depression. finally, the study did not measure the changes in physical health and the degree of reduction in physical activities during covid-19. in conclusion, this is the first network analysis focusing on psychopathological symptoms during the covid-19 pandemic, which provides valuable insights to understand the interactions between depression and anxiety. the current findings indicated the central symptoms and bridge symptoms during the covid-19 outbreak and after peak stages in order to provide clinical suggestions for psychological interventions that target reducing the co-occurrence of symptoms between different mental health problems. author contributions rc, yw, zh, and yf designed the study. yf conducted the study. yf, and zh analyzed the data. yw, rc, zh, and aw drafted the paper. all authors read and approved the final paper. conflict of interest the authors declare that they have no conflict of interest. publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. open access this article is licensed under a creative commons attribution 4.0 international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/. a nationwide survey of psychological distress among chinese people in the covid-19 epidemic: implications and policy recommendations using psychoneuroimmunity against covid-19 suicide risk and prevention during the covid-19 pandemic timely mental health care for the 2019 novel coronavirus outbreak is urgently needed medical journals and the 2019-ncov outbreak a longitudinal study on the mental health of general population during the covid-19 epidemic in china immediate psychological responses and associated factors during the initial stage of the 2019 coronavirus disease (covid-19) epidemic among the general population in china prevalence of stress, anxiety, depression among the general population during the covid-19 pandemic: a systematic review and meta-analysis bridge centrality: a network approach to understanding comorbidity depression and generalized anxiety disorder: co-occurrence and longitudinal patterns in elderly patients the comorbidity of major depression and anxiety disorders: recognition and management in primary care pathways to anxiety-depression comorbidity: a longitudinal examination of childhood anxiety disorders the psychology of fear and stress. cambridge: cambridge university press tripartite model of anxiety and depression: psychometric evidence and taxonomic implications comorbidity of anxiety and depression in children and adolescents: 20 years after complex realities require complex theories: refining and extending the network approach to mental disorders what kinds of things are psychiatric disorders network analysis: an integrative approach to the structure of psychopathology psychometric perspectives on diagnostic systems assessment of symptom network density as a prognostic marker of treatment response in adolescent depression a network analysis of dsm-5 posttraumatic stress disorder symptoms and correlates in us military veterans the phq-9: validity of a brief depression severity measure the immediate impact of the 2019 novel coronavirus (covid-19) outbreak on subjective sleep status validity and utility of the patient health questionnaire (phq)-2 and phq-9 for screening and diagnosis of depression in rural chiapas, mexico: a cross-sectional study a brief measure for assessing generalized anxiety disorder: the gad-7 validation of the generalized anxiety disorder-7 (gad-7) and gad-2 in patients with migraine validity of the generalized anxiety disorder-7 scale in an acute psychiatric sample network analysis of depression and anxiety symptom relationships in a psychiatric sample the network structure of schizotypal personality traits association of symptom network structure with the course of depression models and methods in social network analysis the symptom network structure of depressive symptoms in late-life: results from a european population study sparse inverse covariance estimation with the graphical lasso a new method for constructing networks from binary data estimating psychological networks and their accuracy: a tutorial paper node centrality in weighted networks: generalizing degree and shortest paths cytokine levels and associations with symptom severity in male and female children with autism spectrum disorder borsboom d. qgraph: network visualizations of relationships in psychometric data assorted tools for identifying important nodes in networks acute and chronic posttraumatic stress symptoms in the emergence of posttraumatic stress disorder:a network analysis symptoms of posttraumatic stress disorder in a clinical sample of refugees: a network analysis comparing network structures on three aspects: a permutation test the architecture of complex weighted networks covid-induced economic uncertainty the socio-economic implications of the coronavirus pandemic (covid-19): a review problems shared in psychiatry help-line of a teaching hospital in eastern nepal during covid-19 pandemic lockdown psychomotor symptoms of depression psychomotor symptoms in depression: a diagnostic, pathophysiological and therapeutic tool psychomotor disturbance in depression: defining the constructs perceived causal relations between anxiety, posttraumatic stress and depression: extension to moderation, mediation, and network analysis quality of life of women who practice dance: a systematic review protocol vigorous physical activity, mental health, perceived stress, and socializing among college students physical activity and coronavirus disease 2019 (covid-19): specific recommendations for home-based physical training sanchis-gomar f. health risks and potential remedies during prolonged lockdowns for coronavirus disease 2019 (covid-19) do psychiatric patients experience more psychiatric symptoms during covid-19 pandemic and lockdown? a case-control study with service and research implications for immunopsychiatry default mode network dissociation in depressive and anxiety states the core symptoms of bulimia nervosa, anxiety, and depression: a network analysis network destabilization and transition in depression: new methods for studying the dynamics of therapeutic change covid 19 and its mental health consequences changes in network centrality of psychopathology symptoms between the covid-19 outbreak and after peak progression of mental health services during the covid-19 outbreak in china patients with mental health disorders in the covid-19 epidemic leicester lockdown: streets deserted in city an e-mental health intervention to support burdened people in times of the covid-19 pandemic: cope it the benefits of meditation and mindfulness practices during times of crisis such as covid-19 social distancing in covid-19: what are the mental health implications? mindfulness-based stress reduction (mbsr) reduces anxiety, depression, and suicidal ideation in veterans the efficacy of mindfulness-based meditation therapy on anxiety, depression, and spirituality in japanese patients with cancer key: cord-006292-rqo10s2g authors: kumar, sameer; markscheffel, bernd title: bonded-communities in hantavirus research: a research collaboration network (rcn) analysis date: 2016-04-07 journal: scientometrics doi: 10.1007/s11192-016-1942-1 sha: doc_id: 6292 cord_uid: rqo10s2g hantavirus, one of the deadliest viruses known to humans, hospitalizes tens of thousands of people each year in asia, europe and the americas. transmitted by infected rodents and their excreta, hantavirus are identified as etiologic agents of two main types of diseases—hemorrhagic fever with renal syndrome and hantavirus pulmonary syndrome, the latter having a fatality rate of above 40 %. although considerable research for over two decades has been going on in this area, bibliometric studies to gauge the state of research of this field have been rare. an analysis of 2631 articles, extracted from wos databases on hantavirus between 1980 and 2014, indicated a progressive increase (r (2) = 0.93) in the number of papers over the years, with the majority of papers being published in the usa and europe. about 95 % papers were co-authored and the most common arrangement was 4–6 authors per paper. co-authorship has seen a steady increase (r (2) = 0.57) over the years. we apply research collaboration network analysis to investigate the best-connected authors in the field. the author-based networks have 49 components (connected clump of nodes) with 7373 vertices (authors) and 49,747 edges (co-author associations) between them. the giant component (the largest component) is healthy, occupying 84.19 % or 6208 vertices with 47,117 edges between them. by using edge-weight threshold, we drill down into the network to reveal bonded communities. we find three communities’ hotspots—one, led by researchers at university of helsinki, finland; a second, led by the centers of disease control and prevention, usa; and a third, led by hokkaido university, japan. significant correlation was found between author’s structural position in the network and research performance, thus further supporting a well-studied phenomenon that centrality effects research productivity. however, it was the pagerank centrality that out-performed degree and betweenness centrality in its strength of correlation with research performance. hantavirus, transmitted to humans through persistently infected rodents and their excreta, is a global public health threat hospitalizing tens of thousands of people every year throughout the world. since the isolation of the first hanta virus, htnv (or hantaan virus) in 1976, several other hantaviruses have been identified, with at least 22 being pathogenic to humans (bi et al. 2008) . one of the first major outbreaks of hantavirus was reported from 1951 to 1954 when close of 3200 american soldiers serving in korea became infected with the virus. in recent times, several cases have been reported in asia (zhang et al. 2004) , us, and europe. hantavirus genus belongs to bunyaviridae family and is identified as an etiologic agent of two different types of diseases-hemorrhagic fever with renal syndrome or hfrs and hantavirus pulmonary syndrome (hps). hfrs is also known by earlier names like korean hemorrhagic fever (khf), epidemic hemorrhagic fever (ehf), nephropathia epidemica (ne) (bi et al. 2008) . hprs affects close to 1,50,000-2,00,000 people throughout the world each year while hps infects just about 200. however, the fatalities caused by the latter are above 40 % when compared to 1-12 % in the case of hprs depending on the severity of the virus (lednicky 2003; schmaljohn and hjelle 1997) . hfrs is more prevalent in the eurasian region and hps in the americas. china remains the most endemic nation accounting for close to 70-90 % hfrs cases in the world (zhang et al. 2004) . ne, the mild form of hfrs, is most dominant in western and central europe. a quick glance at the web of science databases reveals a progressive increase in research papers on hantavirus. the research in the field is paving the way to finding more pathogens, associated diseases, and vaccines. however, bibliometric studies to gauge hantavirus research are surprisingly rare. hence, we set out to mainly identify the prominent researchers in the research collaboration network and the bonded communities they were embedded in. research collaboration, a key mechanism that brings multiple talents together to accomplish a research task, could be effectively gauged through bibliometric records in research papers (heinze and kuhlmann 2008) . co-authorship in research papers has long remained the basis of investigating research collaborations (beaver and rosen 1978) . the co-authors of a research paper could reveal the exchange of knowledge among researchers in their effort to bring out a published paper. similarly, the affiliation details in the bibliometric records could be extrapolated to reveal collaboration happening at institutional and international levels. whether research collaboration could be gauged by just looking at the bibliometric records is a matter of academic debate (katz and martin 1997) . for example, a collaboration could take place (i.e. through research advise) even if the two researchers do not finally end up penning the research paper together. then there are some issues of honorary and ghost authorships (wislar et al. wislar et al. 2011) . while these concerns are serious, using bibliometric records is still the most concrete piece of evidence to establish a collaboration. given the fact that co-authorship associations could also help us in understanding the association at institutional, organizational, and international levels, their significance cannot be overlooked. the number of co-authored papers across disciplines has been growing over the years (sonnenwald 2008) . better communications facilities, faster commuting, and industrialization brought in significant changes in the way research was conducted. now there are researchers in large teams working on research projects and naturally these lead to published papers having significantly larger numbers of co-authors. price (1963) calls these large lab-based research projects, 'big science'. however, big science research projects aren't generally in the social domain, that is, researchers do not have much choice to decide on their co-authors. like the sciences, research conducted in the social sciences has also seen significant increase in the number of co-authors (moody 2004) . it remains imperative to note that collaboration is sharing of knowledge and may not always suggest improved quality of work. for example, in the humanities there are still a significant proportion of papers that are solo-written. research collaborations could also be seen from the perspective of networks (kumar and jan 2014) . in a network, two entities form a connection if there is some kind of association between them (newman 2001) . using bibliometric data, these associations could be constructed to understand knowledge flows at multiple levels. co-authorship in published papers is considered a reliable proxy to gauge research collaborations (sonnenwald 2008; melin and persson 1996; katz and martin 1997) . social network analysis, an established research method to analyse social networks, is a set of mathematical algorithms that quantitatively analyse these relationships between nodes (wasserman and faust 1994) . in a co-authorship network, for example, it could be applied to identify various patterns-i.e. the best connected nodes or key actors (taba et al. 2015) or the communities that the researchers form through their associations. specifically, these analyses reveal the pattern of network at both global and local levels. at the global level, the network pattern is seen from a whole network perspective, revealing, for example, the density, transitivity, scale-free pattern, small-world pattern, or the communities or clusters that the nodes form. at the local level, things are seen from the node perspective. centrality is an important concept when looking from the perspective of a node and its context in the entire network. centrality determines the relative importance (through centrality measures such as betweenness, closeness, and pagerank) and connectedness (through 'degree' metric) of nodes. hence, those with higher centrality scores are those who are the most prominent players in the network. another interesting aspect of social networks is that of the ties that the node is directly connected to. the strength of connection (depicted by a thicker line on a network graph) demonstrates a more frequent and stronger relationship than those that have an association of just a single or very few times. the idea of strength of relationship (coleman 1988 ) is challenged by the notion of structural holes (burt 1997) . structural holes theory postulates that the absence of ties in an ego network (network of ego-central node-and alters or immediate connections and those immediate connections connecting to one another) brings in more opportunities to the ego (the central node) as the ego then acts a bridge for the flow of resources between the 'alters'. yet another idea of ties is postulated by the concept of 'weak ties'. the theory argues that in contrast to strong ties, which bring in trust, weak ties bring in new knowledge in the network. growth and preferential attachment are the prime features of self-organising networks (barabasi and bonabeau 2003) . preferential attachment (kumar and jan 2015) is defined by the preference of nodes (due to affinity or similarity) to attach to another node. in the context of co-authorship network, it may be due to the fact that one author connects to another author because he or she is a well-known researcher or has the same nationality as others. preferential attachment causes some nodes to have much higher number of connections than most other nodes in the network. these hubs are kind of 'power houses' that tie together the network. this is the very reason why a self-organising network are small worlds (has a short path between any two random nodes) (watts and strogatz 1998) . a targeted attack or absence due to some other reason could break the network down into pieces, which could severely affect the flow of resources in the network. nonetheless, these self-organising networks are quite tolerant to random attacks (albert et al. 2000) . why do researchers collaborate? there are several benefits to collaboration (beaver 2001) . sharing of expertise and division of work are among the most prominent. collaboration also allows sharing of resources. for example, it is possible that certain equipment may not be available to certain researcher and collaborating with someone who has access to this equipment enables the conduct of research. collaboration, due to division of labour, technically reduces the duration for the completion of research project, enabling researchers to publish more papers. due to requirement for promotion and tenureships, which require papers to be published in high impact journals, collaboration does really help. our goals here are two pronged. first, we are interested in knowing the prominent and most connected authors in the field. a number of studies in recent times have found that the relative position significantly correlates with the research performance of researchers (abbasi et al. 2011; kumar and jan 2013a) . we want to check if this stands true for our (hantavirus) dataset. however, another significant goal of this study is to detect the bonded communities of hantavirus research. with bonded communities we simply mean the cluster of researchers who interact more often with each other. a network of thousands of nodes otherwise only results in a hairball-like network that hardly provides much understanding or meaning. thus, in addition to common bibliometric analyses (i.e. annual paper production, average citations, top papers, number of papers per country, author research productivity, etc.), the present study has the following main objectives: a. investigate the prominent authors and the bonded-research communities clusters in hantavirus research. b. investigate if there is relationship between players or actors structural position in the network and research productivity. the study has significance as this would be perhaps one of the first studies to investigate research performance and bonded communities in hantavirus research from the perspective of research collaborations and networks. the idea of reaching out to bonded communities may be helpful to scientometricians wanting to get to the core of researchers who thickly interact with one another. they are the 'nucleus' or the real seat of knowledge of the network. gauging and mapping of research performance of a crucial area such as hantavirus is of immense relevance and importance to health and research policy makers. in addition, it attempts to understand if indeed the structural position in the network (i.e. the connectedness of actors in the network) has any significant correlation with research productivity. such results would add to the existing body of knowledge about whether or not structural connectedness in a network does affect academic performance. the rest of the paper is structured as follows: in the material and method section, we next discuss the data harvesting method and the keywords used to select the records. subsequently, we discuss the findings and finally we draw our conclusions. records were harvested from the web of sciences databases from 1980 to 2014. important hantavirus related keywords such as ''hantavirus'', ''hantaan virus'', ''hemorrhagic fever with renal syndrome'', ''hantavirus pulmonary syndrome'', ''korean hemorrhagic fever'', ''epidemic hemorrhagic fever'' and ''nephropathia epidemica'' were used to refine the records selection. following search command was used: topic: (''hantavirus'' or ''hantaan virus'' or ''hemorrhagic fever with renal syndrome'' or ''hantavirus pulmonary syndrome'' or ''korean hemorrhagic fever'' or ''epidemic hemorrhagic fever'' or ''nephropathia epidemica''). refined by: document types: (article). timespan: 1980-2014. indexes: sci-expanded, ssci, a&hci, cpci-s, cpci-ssh. the above keywords search and data cleaning resulted in the final availability of 2631 records for analysis. data cleaning is an arduous task in bibliometric studies. author name variations are among the most complicated as two or more authors may have same name and some even have the same institutional affiliation and hence their publications could be combined and shown as coming from a single author. on the other hand, an author may have different name variations and his or her publication may get split across these different name variations. at the institution and country levels, there is a need to make the names uniform. for example, in the present set of records, at the institution level, usa actually is an abbreviation of ''us army''. in older data some of the country names are not mentioned, hence, by manual checking, they were appended. by manual checking much of these issues were resolved and errors minimised. social network analysis (wasserman and faust 1994 ) is a main research method applied in this study. as mentioned earlier, a network could be constructed when two entities are related in some way. on the graph, nodes are represented by a 'dot' and the connection between nodes, as a line passing between them. hence, if two or more authors associate to co-write a research paper, the authors would be represented as nodes and the co-authored paper (the basis of relationship) is represented with a line passing between them. nodes are also referred to as 'vertices' and relationship between them as 'edges'. it is obvious that just one representation of co-authored with dots and lines on a graph does not reveal much but when hundreds and (at times thousands) of papers are represented in a graph, an interplay of association is revealed and how seemingly invisible associations become visible. data elements from the records are extracted and the co-authorship network constructed (see fig. 1 ). scientometrics (2016) 109:533-550 537 three centrality measures are calculated-the degree, betweenness centrality, and pagerank centrality. we also calculate the local clustering coefficient and the average geodesic distance of the network. we have not calculated closeness centrality as this centrality gives accurate results for one component (typically a giant component) at a time. it tends to give misleading results if the calculation is made for all the components in the network (for example, those in the dyadic network will have high closeness centrality that those nodes with high degree in the main component). since we are interested in all the authors in the network (and not just those in the giant component), we have chosen to leave out closeness centrality in our graph metrics calculations. degree, a popularity measure, is simply the number of direct connections a node has. betweenness centrality is path-based and checks how much 'in-between' a node is in the network. those with high betweenness centrality have positional advantage and work as bridges between communities. removal of these nodes could severely affect the flow of resources in the network. a pagerank measure is a prestige metric that not only checks the number of connection a node has but also the number of connection of alters. the mathematical formulae used to calculate are standard and are thus provided in ''appendix''. the centrality values are then correlated using ms-excel's correlation statistical function, with number of papers produced and citations accumulated, to check if there is any significant association between the two. nodexl (smith et al. 2009 ) was used to calculate graph metrics and visualize the network diagrams. the yearly paper production shows an upward trend. the worldwide alarm raised by the deadly virus has had researchers looking for the pathogens, its geographical reach, and its potential cure. from just four related papers published in 1980, the number grew to 180 in general bibiliometric record of a single paper author network: only those records with two or more authors are skimmed for analysis. (a dyad, the smallest building block of a network) author/s affiliaɵon addresses citaɵons … author n author2 paper code … author 1 author 2 fig. 1 the extraction of data elements from bibliographic records and construction of co-authorship network 2014. a linear trendline (r 2 = 0.93) shows a good-fit, meaning that the growth in paper production on hantavirus has been steady over the years (fig. 2) . however, a large proportion of paper production has been concentrated in certain regions of the world. majority of the research is going on in europe and the usa (see fig. 3 ) when contrasted with the actual occurrence of hantavirus infection cases (see fig. 4 ), we find that china (although a distant second in terms of number of paper produced) is probably doing comparatively much less research when compared to the number of hantavirus cases reported from the region. as mentioned earlier, china accounts for close to 70-90 % of all hanta virus cases in the world. the top ten countries in terms of research productivity are, usa (1003 papers the entire publication base of hantavirus's 2631 papers received a total of 58,078 citations or an average of 21.09 citations per paper. these are good averages and indicate sound 'health' of research in the field. the papers written have a downward trajectory in terms of average citations received-those papers that have been written earlier are cited significantly more than those that are published in later years (see fig. 2 ). this is of course practical as papers that are written earlier have stayed in the knowledge base for a much longer time than the recent ones and thus have more opportunity to get more cited. some also get a chance (depending on its influence to the field) to enter the very 'seminal knowledge'. once these papers are in this select group, they are cited considerably more than the rest of the papers. 1333 papers published during 2005-2014 time frame were cited 11.39 times, compared to 1298 papers published older time period during 1980-2004 that were cited 33.10 times on average. 92 % papers in the older time period had received at least one citations when compared to 85.67 % in the newer timeframe (2005) (2006) (2007) (2008) (2009) (2010) (2011) (2012) (2013) (2014) . of the 7426 authors, a large proportion or 5152 authors (69.37 %) have produced just one paper. 1034 authors (13.92 %) have produced two papers each, 1230 authors (16.56 %) have produced 3 papers each, 213 authors (2.86 %) 4 papers each, and 617 authors (8.30 %) 5 papers and above. 19 authors are highly productive and have produced 50 papers and more. vaheri a (160 papers), lundkvist a (134), plyusnin a (105), arikawa j (103) and hjelle, b (94) are the most productive authors in the dataset. in our dataset, 701 authors have received no citations, 2608 authors had between 1 and 10 citations each, 3298 authors had between 11 and 100 citations each and the rest (819 authors) have 100 citations and more. 44 authors had 1000 and more citations each with peters cj (6671 citations), ksiazak, tg (6048), vaheri a (5969), lundkist a (4612), rollin pe (4590) garnering the top five slots as the most cited authors. in both number of papers/author and citations/author, we notice few authors have been significantly more productive than the rest of the block, a common feature of research productivity in most disciplines. of 428,546 cumulative citations (if a paper has four coauthors and has received ten citations for the paper, cumulative citations for the authors would be ten for each author) by authors, 350,588 citations (or 81.80 %) are garnered by top 20 % of authors, thus, almost fitting 80/20 rule or power law. as noted earlier, there is a whole host of research that has shown that the co-authorship in paper across disciplines has gone up especially in the last two decades. an analysis of coauthorship (or average number of co-authors on each paper) of our dataset shows that the same is true for publications in the field. however, there hasn't been a striking increase in the number of co-authors in the two time periods-1980-2004 timeframe had an average of 5.29 authors per paper when compared to 6.82 in the time period between 2005 and 2014. about 95 % papers were co-authored (or had at least two authors on a paper). the most common arrangement was 4-6 authors per paper. there were 333 4-author papers, 326 5-author papers and 332 6-author papers. more authors per paper are symbolic of experimental research. two papers had 67 and 86 co-authors respectively. table 1 shows the list of top ten most cited papers in hantavirus research. the most cited paper is the year 1993 paper by nichol et al. (1993) that was published after the outbreak of hantavirus in the four corners region of the united states. their study showed that the scientometrics (2016) 109:533-550 541 comparison of the human and rodent sequences had a direct genetic link between the virus in infected rodents and infected human 'hantaviral ards' cases. on the heels of this study was another highly cited paper by duchin et al. (1994) or ne) and highly cited paper is that of brummerkorvenkontio et al. (1980) . their study concluded that the detection of ne antigen in rodents (bank voles) facilitates 'specific serologic diagnosis of ne'. the heat map drawn using vosviewer (van eck and waltman 2010) in fig. 5 shows the scientific landscape based on co-citations. co-citation analysis looks at the relatedness of items based on the number of times they are cited together. we use author (first author only) co-citations for the analysis. visualization is automatically done by the software after a threshold is provided by the user. papers nichol st, 1993 (science) and schmalijohn c, 1997 (emerging infectious diseases) and lee hw, 1978 (journal of infectious diseases) are among the most influential papers. in this section, we investigate if the connectedness and relative position of authors have effect on the research performance and then analyze bonded communities embedded in coauthorship networks. the process of the construction of network is explained in the materials and methods section. the author-based networks have 49 components (connected clump of nodes) with 7373 vertices and 49,747 edges between them (see fig. 6 ). the giant component (the largest component) is healthy occupying 84.19 % or 6208 vertices and 47,117 edges between them. a healthy giant component may be an indication of frequent collaborative activity. the giant component is considered the seat of main activity in the research community (fatt et al. 2010) . knowledge flows in such networks are faster as they are not subject to disruptions which otherwise would have been the case had the giant component been small and the whole network having several small fragments of components. a recent study by liu and xia (2015) found that the development of an inter-disciplinary field is the giant components continue to grow as more components connect to it. after all, it takes just an edge from a disconnected component to connect to the giant component, thus, making the latter bigger in size. the average geodesic distance (shortest distance between any two random nodes in the network) between the nodes is just 4.15 meaning that, on average, two random authors in the hantavirus dataset are just about four hops away from one another. this is another indication that the authors are closely knit and resource flow and delivery would be faster in this network when compared to networks that are sparse and fragmented. this also confirms the small-world nature of this network (newman 2001) . small world networks typically have shorter geodesic distances. the centrality values (degree, betweenness and pagerank) of authors makes hjelle b. (brian hjelle) the most connected author in the hantavirus research community (see table 2 ). dr. brian, a pathologist, is currently the md/ph.d program director at the university of new mexico (usa) and has several awards and recognitions to his credit. he has been conducting research on hantavirus since the 1990s and was also a member of the hantavirus pulmonary syndrome clinical trial committee for the national institute of allergy and infectious diseases, national institutes of health (collaborative antiviral study group) from 1993-1996 (http://pathology.unm.edu/faculty/faculty/cvs/brian-hjelle. pdf). fig. 6 the overall co-authorship networks of hantavirus dataset (drawn with fruchterman-reingold on repulsive force between vertices 4.0 and iterations per layout 50 force directed layout). the darker clump at the center is symbolic of those nodes that are highly connected the local clustering coefficient provides an interesting picture-those with high degree have low clustering coefficient (correlation -0.404). why is this the case? clustering coefficient or transitivity is a measure of prediction that if nodes b and c have common partner a, it is a likelihood that b would eventually connect with c. we surmise that this is due to the fact that a node or ego with many alters, will likely have alters that have less connections among them. this is true in many occasions as the ego with large connections would have these connections from several diverse set of nodes. several studies in the recent years have found that centrality measures indeed have significant effect on research performance (abbasi et al. 2011; uddin et al. 2012) . hence we set out to investigate if centrality measures have effects on research performance in the dataset of hantavirus research, too. our correlation test (see table 3 ) confirms that indeed in hantavirus datasets there is a significant correlation (p \ 0.01) between centrality however, what stands out is the correlation of pagerank with research performance. its correlation coefficient strength with research performance demonstrates its efficacy that is even higher than the well-known measures such as degree and betweenness centrality. the very fact the pagerank is based not just on the connections an author has but the quality of these connections, provides it with a better predictability for research performance. here we also introduce an idea to detect 'bonded communities'. by increasing the threshold of edge-weight between nodes, a research community could be drilled down to a level where those nodes that frequently interact with one another are revealed. the importance of strength or 'bondedness' needs attention as this may provide new insights into the communities lying within. drilling down to the desired core (we call it as 'edgecore') is done by progressively increasing edge-weight, till the most bonded communities become visible-it could happen with just three or four in sparse communities and could be ten or more in dense communities. when our network is reduced and visualized with edge weight ten (edge-core-ten) network (the network only visualizes nodes that have an edge weight of ten and more between them), three distinct 'bonded' communities emerge. authors (or nodes) in these communities are involved in repeat associations with one another (see fig. 7 ). quite interestingly, at a threshold of edge-weight ten and above, the community of hjelle b, the most connected author, becomes isolated. this probably goes to show that even best connected author/s may not be embedded in bonded-communities. community a, is led by vaheri a (vaheri antti) who has co-authored with vapalahti o (vapalahti olli) (76 times), plyusnin a (plyusnin, alexander) (65 times) and lundkvist a (lundkvist ake) (48 times). antti vaheri works at dept of virology at university of helsinki, finland and has been active since the 1980s. as a matter of fact, his papers, (brummerkorvenkontio et al. 1980; schmaljohn et al. 1985) are among the most cited hantavirus related papers. olli vapalahti and alexander plyusnin, too, are associated with university of helsinki while ake lundkvist is associated with karolinska institute, sweden. while community a has japanese authors, this community has european authors and is dominated by scholars from university of helsinki. within the edge-core-ten community, ake lundkvist is the author with the highest betweenness. he is a bridge node connecting to the sub-community of germany-based authors-ulrich r, meisel h, kruger dh and klempa b. community b that has prominent authors arikawa j (arikawa jiro), yoshimatsu k (yoshimatsu kumiko), takashima i (takashima ikuo) and kariwa h (kariwa hiroaki) are all from japan's hokkaido university. being from the same institution also provides the necessary geographical proximity to carry out joint research. community c has prominent authors ksiazek tg (ksiazek, thomas g); rollin pe (rollin, pierre e), nichol st (nichol, stuart t), peters cj (peters, clarance james), zaki sr and khan as, all associated with center for disease control & prevention, atlanta, usa. another prominent author mills, jn (mills, james n) is associated with emory university, atlanta, usa. clarence james peters is an accomplished physician who has a well-cited book (peters and olshaker 1997) , while zaki sr has, to his credit, papers (duchin et al. 1994; zaki et al. 1995) that are among the top ten most cited papers on hantavirus. however, peters cj, zaki sr and khan as have not published (in the dataset) after 2007, 2002 and 2004, respectively. as we see, the community is dominated by authors from the centers for disease control and prevention and all the prominent authors are stationed in atlanta, which again shows that geographical proximity is an important factor for deep-bonded association. university of helsinki, karolinska institute, and swedish inst of infectious disease control dominate the institutional collaborations in europe. at the same time, the centers for disease control and prevention and university of new mexico have a sustained and bonded relationship within usa. based on collaboration among institutions contributing at least ten or more research papers, university of new mexico has the maximum degree (collaborating with 51 institutions), followed by university of helsinki (41), centers for disease control and prevention (36), and karolinska institute (32). all the prominent authors as discussed also belong to these institutions. in the same stride, we thus see sweden and finland involved with extensive collaboration (51 repeat associations) in europe while usa almost controls international collaboration with majority of countries including argentina (40 repeat associations), south korea (40), and peoples republic of china (32). germany has a fair share of collaboration with sweden and slovakia. here we scientometrically analysed the research landscape of hantavirus research. by network reduction or by drilling down into the network based on the strength of ties (or edge weights), we revealed the communities that thickly interact with one other. we demonstrate that these bonded communities actually capture the most prominent authors, too. in our opinion, these bonded communities are the core or ''central brain'' of the network where central activity takes place. we also theorize that strength of relationship is an equally important criterion (apart from centrality measures) for sustainable research performance. pagerank stands out in its correlation with the research performance which further substantiate the idea that it is not only the number of other authors an author is connected to but the quality of these authors (how well those co-authors are connected) that ensures research visibility. where m jk is the number of geodesic paths from vertex j to vertex k(j, k = i) and m jik is the number of geodesic paths from vertex j to vertex k, passing through vertex i (otte and rousseau 2002; linton 1977) . pagerank is an importance measure that is calculated based on the premise that 'having links to page p from prominent pages, is a good indication that page p is important one too' (page et al. 1999) . identifying the effects of co-authorship networks on the performance of scholars: a correlation and regression analysis of performance measures and social network analysis measures error and attack tolerance of complex networks scale-free networks evolution of the social network of scientific collaborations reflections on scientific collaboration, (and its study): past, present, and future studies in scientific collaboration hantavirus infection: a review and global update nephropathia epidemica-detection of antigen in bank voles and serologic diagnosis of human infection the contingent value of social capital social capital in the creation of human capital hantavirus pulmonary syndrome-a clinical description of 17 patients with a newly recognized disease the structure of collaboration in the across institutional boundaries? research collaboration in german public sector nanoscience a global perspective on hantavirus ecology, epidemiology, and disease what is research collaboration? research policy mapping research collaborations in the business and management field in malaysia on giant components in research collaboration networks: case of engineering disciplines in malaysia research collaboration networks of two oic nations: comparative study between turkey and malaysia in the field of 'energy fuels the assortativity of scholars at a research-intensive university in malaysia. the electronic library hantaviruses-a short review a set of measures of centrality based on betweenness structure and evolution of co-authorship network in an interdisciplinary research field studying research collaboration using co-authorships the structure of a social science collaboration network: disciplinary cohesion from 1963 to 1999 the structure of scientific collaboration networks coauthorship networks and patterns of scientific collaboration genetic identification of a hantavirus associated with an outbreak of acute respiratory illness social network analysis: a powerful strategy, also for the information sciences the pagerank citation ranking: bringing order to the web virus hunter: thirty years of battling hot viruses around the world big science, little science antigenic and genetic properties of viruses linked to hemorrhagic-fever with renal syndrome hantaviruses: a global disease problem analyzing (social media) networks with nodexl scientific collaboration. annual review of information science and technology towards understanding longitudinal collaboration networks: a case of mammography performance research trend and efficiency analysis of co-authorship network software survey: vosviewer, a computer program for bibliometric mapping social network analysis, methods and applications (1st edition, structural analysis in the social sciences) collective dynamics of 'small-world' networks honorary and ghost authorship in high impact biomedical journals: a cross sectional survey hantavirus pulmonary syndrome: pathogenesis of an emerging infectious-disease the epidemic characteristics and preventive measures of hemorrhagic fever with syndromes in china acknowledgments part of the analysis of this study was completed during s.k's research visit to tu-ilmenau, germany. the study is supported by high impact research, university of malaya, grant number um.c/625/1/hir/mohe/sc/13/3. sna measures (kumar and jan 2013a, b) .a component is a set of nodes joined in such a way that any single random node in the network could reach out to any other random node by ''…traversing a suitable path of intermediate collaborators'' (newman 2004) .clustering coefficient, c, is also known as 'transitivity' and more accurately as the 'fraction of transitive triples' (wasserman and faust 1994) . mathematically, clustering coefficient is calculated as:where the number of triangles represents trios of nodes in which each node is connected to both others, and connected triples represent trios of nodes in which at least one node is connected to both others (barabasi et al. 2002; newman 2004) . degree is the most common and probably the most effective centrality measure to determine both the influence and importance of a node. a degree is simply the number of edges incident on the vertex. mathematically, degree k i of a vertex iswhere g ij = 1 if there is a connection between vertices i and j and g ij = 0 if there is no such connection. (otte and rousseau 2002) .betweenness centrality of a vertex i is the fraction of geodesic paths that pass through i, which could be mathematically represented as key: cord-340101-n9zqc1gm authors: bzdok, danilo; dunbar, robin i.m. title: the neurobiology of social distance date: 2020-06-03 journal: trends cogn sci doi: 10.1016/j.tics.2020.05.016 sha: doc_id: 340101 cord_uid: n9zqc1gm abstract never before have we experienced social isolation on such a massive scale as we have in response to covid-19. yet we know that the social environment has a dramatic impact on our sense of life satisfaction and well-being. in times of distress, crisis, or disaster, human resilience depends on the richness and strength of social connections, as well as active engagement in groups and communities. over recent years, evidence emerging from various disciplines has made it abundantly clear: loneliness may be the most potent threat to survival and longevity. here, we highlight the benefits of social bonds, choreographies of bond creation and maintenance, as well as the neurocognitive basis of social isolation and its deep consequences for mental and physical health. j o u r n a l p r e -p r o o f 3 are conventionally most concerned about all had much less impact on survival rates. key factors included obesity, diet, alcohol consumption, how much exercise was taken, the drug treatments prescribed, and local air pollution. these authors conducted a follow-up analysis of 70 studies of longevity in older people, which followed ~3.5 million people over an average of ~7 years [16] : social isolation, living alone and feeling lonely increased the chances of dying by about 30%, even after accounting for age, sex and health status. many other studies have shown that social isolation (though not self-reported feelings of loneliness) was a significant predictor of the risk of death. for example, a longitudinal analysis of ~6,500 british men and women in their fifties [17] found that being socially isolated increases the risk that you will die in the next decade by about 25%. quantitative analysis of nearly ~400,000 married couples in the american medicare database revealed that, for men, the death of their spouse increased their own chances of dying in the immediate future by 18%. the death of the husband in turn increased the wife's risk of dying by 16% [18] . similar effects on morbidity rates have been found with respect to social support. a series of elegant prospective studies using data from the framingham heart study [19, 20] found that the chances of becoming happy, depressed or obese were all strongly mirrored by similar changes in the closest friend. there was a smaller significant effect due to the behaviour of the friends' friend. even a just detectable effect was present due to the friend of a friend's friend, but nothing beyond. this contagion phenomenon was especially strong if the friendship was reciprocal (i.e., both individuals listed each other as a friend). if the friendship was not mutual, the social contagion effect was negligible. the investigators also documented a strong effect of "geographical contagion". if you have a happy friend who lives within a mile radius, you are 25% more likely to become happy. and you are 34% more likely to be happy if your next-door neighbour is happy. people who belong to more groups are less likely to experience bouts of depression. such findings emerged from the uk longitudinal study of ageing (elsa) that repeatedly profiled around ~5,000 people from the age of 50 onwards. previous research showed [21] that depressed people reduced their risk of depression at a later time point by almost a quarter if they joined a social group such as a sports club, church, political party, hobby group or charity. indeed, joining three groups individuals were more immersed in their local community and trusted their neighbours more [22] [23] [24] . the causal directionality was difficult to pin down in these cases because of the cross-sectional nature of the data. nevertheless, path analysis provided some indication that intensity of social exchange was the candidate driver. the impetus to access social capital in the wider community [7] extends beyond humans. there is now a wealth of evidence from long-term field studies of wild baboons that socially wellconnected females experience less harassment by other monkeys [7, 23] , have lower levels of cortisol stress hormones [25, 26] , faster wound healing [27] , produce more offspring and live longer [28] [29] [30] [31] . such ramifications of social capital appear to hold up across a diversity of species, including chimpanzees [32] , macaques [33] [34] [35] , feral horses [36, 37] and dolphins [38] . a key underlying reason for these effects, at least in humans, is likely that loneliness directly impairs the immune system, making you less resistant to diseases and infections. research found [39] that freshmen students who reported feeling lonely had a reduced immune system response when they were given a flu vaccine compared to students who felt socially well engaged. moreover, those students with only 4-12 close friends had significantly poorer responses than those with 13-20 friends. these two effects seemed to interact with each other: having many friends (a large social group of nineteen or twenty friends) seems to buffer against a weakened immune response. yet, feeling lonely and having few friends results in a particularly poor immune defence. other investigators [40] used data from the framingham heart study to show that people with fewer contacts in their social network had elevated serum fibrinogen concentrations. in contrast, people enjoying many social contacts had low fibrinogen levels. fibrinogen plays an important role in blood clotting when a blood vessel has been ruptured, as well as facilitating wound healing and tissue repair more generally: high concentrations thus signal poor health. endorphins constitute a core component of the psychoendocrine mechanism underpinning friendship (see box 1). other research found [41] that social bonds stimulate the release of the body's natural killer cells, one of the white blood cells of the innate immune system whose core function is to destroy harmful bacteria and viruses. people who are more socially integrated have better adjusted biomarkers for physiological function, as indexed by lower systolic blood pressure, lower body mass index, and lower levels of creactive protein -the latter being another molecular response to inflammation. this insight was evident in each of four age groups (adolescents, young adults, middle age and old age) based on data from four large longitudinal american health databases [42] . the investigators found that, in j o u r n a l p r e -p r o o f adolescence, lack of social engagement had as big an effect on risk of inflammation as lack of physical activity. in old age, lack of friends had a bigger effect on risk of hypertension than the usually cited clinical causes like diabetes. even more worrying, the effects of social relationships on these physiological measures of good health during adolescence and young adulthood can persist into old age. in a longitudinal study of 267 males, for example, research found [43] that the more socially integrated a child was at six years of age, the lower their blood pressure and body mass index (a measure of fatness) two decades later in their early thirties. this result held up when they controlled for race, body mass index in childhood, parental socioeconomic status, childhood health and extraversion. social isolation may well have pervasive effects on brain connectivity. if rats are socially isolated when young (a condition that would give rise to feelings of loneliness in humans), neural function and plasticity are altered [44] [45] [46] [47] . in particular, episodes of social isolation can irretrievably alter the function of the prefrontal cortex (the part of the brain that is central to managing our social relationships [see below]), as well as its axon myelinisation (the laying down of the fatty sheaths around neurons that enable them to transmit signals faster and more efficiently) [44] . while short periods of loneliness in humans rarely have any long-term adverse outcomes, persistent loneliness escalates the risk of alzheimer's disease and depression [48, 49] . loneliness also leads to poor sleeping habits, with adverse psychological and physiological consequences [50] . the fact that friends can have such dramatic effects on our health and well-being may lead us to suppose that the more friends we have, the better. however, the number of friends and family relationships we can manage at any given time is limited by cognitive constraints to ~150 [51, 52] . there is, however, considerable individual variation, with social network sizes ranging between approximately 100-250. a number of fairly conventional factors are responsible for this variation: age (younger people typically have larger social networks than older people [53] ), sex (females usually have larger social networks than males [53, 54] ; though this does vary with age [55] ), personality (extraverts have larger social networks than introverts [56] ; women who score high on the neuroticism personality dimension have fewer acquaintances than those who score lower [57] ). friendships, however, require the investment of considerable time to create and maintain. the emotional quality of a friendship depends directly on the time invested in a given social link [57] [58] [59] . one prospective study estimated that it takes around 200 hours of face-to-face contact over a three-month period to turn a stranger into a good friend [60] . conversely, the emotional quality of a j o u r n a l p r e -p r o o f 6 relationship will decline rapidly ( figure 1 ) if contact rates drop below those appropriate to the relationship quality [61] . time resources, however, are naturally limited: we devote only around 20% of our day to direct social interaction (excluding business-related interactions), equivalent to about 3.5 hours per day [62] . given that our relationships are not all of equal value to us (friends serve a variety of different functions for us [63, 64] ), we allocate our valuable time across our social network in such a way as to maximise the different benefits that friends of different quality provide [65] . this dynamic results in a specific social fingerprint that is unique to each of us [66] . nonetheless, there are some broadly consistent patterns: a 40% share of our time is devoted to our five closest friends and family, and a further 20% to the ten next closest individuals. in other words, 60% of the 3.5 hours a day we spend in social interaction are devoted to just 15 people. social partners in the outermost layers of the social network each receive just 30 secs of our time a day on average. this gives rise to a very distinctive layering to our social networks, with layers that have a characteristic fractal pattern: the innermost layers of closest friends is very small (typically 5 people) but intense, the outermost ( ~150) very large but more casual [67, 68] . it is that inner circle of five closest friends and family that seems to matter most in terms of the buffering of both loneliness and disease. geographical distance also imposes strong constraints on the organization of friendship. the '30-min rule' provides an empirical reminder that people are less willing to visit friends and family who live more than 30 mins away -no matter whether that involves travel on foot, by bicycle or by car [69] . cutting across this effect is the influence of genetic relatedness: the kinship premium (i.e., the strong mutual benefits that kinship typically affords) incentivizes us to travel an extra mile to maintain contact with family than we are with friends [70] . while the role of close contacts, like friends, is pivotal, other regular contacts can also contribute to one's social capital. previous authors [71] famously claimed that weak -as opposed to strong, or close -ties provide important sources of external information. analyses of information flow in social networks suggest that sources outside the 50 closest friendships offer few benefits [72] . other benefits of interaction with more loose social ties can, of course, include heightened subjective well-being and sense of belonging to the local community [73] . however, as is often the case in such studies, it is crucial to precisely define the meaning of weak versus strong ties, since all weak ties belonged to the same community (a student class). regular interaction with different people at the periphery of social networks can give rise to heightened perceived social and emotional fulfillment in ways that act as psychological buffers [24] , although this might depend on personality or social style [74] . social-affective processes in the presence of others take a different form than during the others' physical absence. already in a nursery, if a baby starts crying, other nearby babies hear the distress signal and typically also start crying by mere emotional contagion. in addition to utterances and prosody, humans tend to align their communication towards each other by imitating vocabulary, grammar, mimics and gestures. for instance, humans tend to unconsciously synchronize their facial expressions even with people who are directing gaze at somebody else [75] . such subliminal motor and emotional resonance is typically found to be intrinsically rewarding [76] . on the positive side, contagion processes can uplift an individual's happiness through people within the close neighborhood, but also miles apart [19] . on the negative side, loneliness also spreads rapidly through an individual's social interaction partners, thus affecting even friends of friends of friends [77, 78] . reading others' faces -impossible during a conventional phone call -may be an evolutionarily conserved means for exchanging pivotal information, which coevolved with the corresponding decoding machinery in brain and behavior responses (see next section). faces offer a plethora of social information about an individuals' sex, age, ethnicity, emotional expression and potentially their intentions and mental state (all of which influence the strength of the bond between two individuals [59] ). throughout development, learning and maturing critically hinge on joint attention of two individuals on the same object [79, 80] . such mentalizing and eye gaze processes have been repeatedly linked to the higher associative and the striatal reward circuitry [79, [81] [82] [83] . some authors even argue that the importance of such facets of interpersonal exchange may explain why humans developed wide and white sclera in the eyes -more easily visible than in most animals [84] . what may lead to greater vulnerability to predators for some species (by making the individual and her intentions more visible and exploitable) may have boosted learning and cooperation in human primates [85] . such evolutionary adaptations facilitate how humans automatically represent the (visual) perspective of nearby others. making statements about objects channels [88] . compared to actual interpersonal encounters, a surprising number of psychological constants exist in how humans entertain and juggle with social relationships in digital environments. for example, the upper bound of ~150 contacts (cf. above), as well as the structure of these networks, appears to hold across both the real world and a variety of virtual online contexts [53, 68, 89, 90] , suggesting that group size in today's society is still orchestrated by the same principles as when we were hunter-gatherers. indeed, several neuroimaging studies [e.g., 51, 91] broadly confirm that our online social networks correlate with the volumes of the same core brain regions that resonate with the size of our offline networks [52, 92] . these constancies suggest that lively virtual social interaction may similarly entrain faculties like memory and concept generation. conversely, paucity of social interaction and loneliness may have deleterious effects on the cognitive and memory systems. it is conceivable that enhancement or decline of cognitive and neural reserve may be mediated by analogous pathways potentially involving dendritic arborization in the hippocampal and prefrontal regions [49] . the need for personalized interactions may already be reflected in the way that stock market traders sometimes add coded numbers to money transfers (e.g., 10,000,467 instead of 10,000,000 shares) as a potential replacement for the recognition of somebody's unique facial identity rather than remaining anonymous [93, 94] . this attractor for a full range of face-to-face cues during social interactions may explain why emojis have become so popular: they replace the important emotional signals in the absence of the ostensive facial cues that we use for the interpretation of utterances in the face-toface environment. these considerations raise the important question how the brain implements toggling between real-world social interactions and virtual or imagined social interaction in the absence of physical contact [79] . the right temporoparietal junction was proposed as a key switching relay between two antagonistic classes of neurocognitive processes: those more anchored in one's current external sensory environment and more stimulus-independent ones relying on internally generated information [95] . this idea was later substantiated by a multi-modal neuroimaging study in 10,000 humans [96] : the right and left temporoparietal junction explained most variation in functional coupling changes between all major brain networks. hence, these two association cortex regions may help mediate shifts of focus from the person in front of you to a person you are texting with on the phone, who is out of sight or touch. taken together, evidence of digital communication suggests that this new medium does not in fact change the general pattern of our social interactions or the numbers of people we contact [68, 89, 90, 97] . the sizes of the layers in our social networks are unchanged by using digital media or virtual communication. also, the frequencies with which we contact certain people in each social j o u r n a l p r e -p r o o f layer are strikingly similar in the online and offline worlds. some digital vehicles, however, lack the communicative richness of real face-to-face interactions: when asked to rate their satisfaction with interactions with their five closest friends each day, participants rated face-to-face and skype interactions as equally satisfying and both as significantly more satisfying than interactions with the same individual by phone, text messaging, sms messaging, email or text-based social media such as facebook [87] . human and non-human primates live in groups mainly to minimize external ecological threats, including predators, raiding by neighbors, and environmental risk. advanced forms of cooperation are rare in non-primate species [98, 99] and probably emerged in non-human primates several million years ago. today, the average humans spends up to 80% of waking hours in the presence of others [100, 101] . investing cognitive resources in keeping track of friends, family and colleagues is highly demanding -more costly than contemplating the physical facts [102, 103] . not only time limits (cf. above) but also neurocognitive limits [e.g., 104] effectively constrain how close one can be to how many individuals. but how is regular social stimulation reflected in neurobiology? in monkeys [105, 106] and in humans [51, 52, 107, 108] , various indices of sociality and measures of social network size are robustly associated with specific regions of the neocortex. these same regions are responsible for processing social information such as predicting others' intentions [109, 110] . at least some of these brain-behavior associations may be cross-culturally consistent in humans, as evidenced by a structural neuroimaging study in the usa and china [111] . whole-brain analyses have repeatedly highlighted a relationship between the ventromedial prefrontal cortex and measures of social network complexity and social competence [92, 105, 110, [112] [113] [114] [115] . the ventromedial prefrontal cortex and striatal nucleus accumbens have been found to play a key role in both social reward behaviors and the amount of social stimulation in humans [113] and other mammals [e.g., 44, 47] . functional neuroimaging has shown that these neural correlates are also implicated in tracking others' popularity status in real-world social networks [116] . similarly, positron emission tomography has shown that, in humans, the density of mu-receptors for betaendorphin, especially in the ventromedial prefrontal cortex, correlates with social attachment style, for which endorphins are more important than other neuropeptides [117] . other evidence, such as in a functional neuroimaging study on maintenance and manipulation of social working memory [104] , has also related the dorsomedial prefrontal cortex to social network properties. there are similar correlations for social cognitive skills like mentalizing that are crucial to maintaining functional social relationships [118] [119] [120] . analyses of social richness and brain morphology in humans tend to identify a neural network involving the prefrontal cortex with several parts of the so-called default mode network as being crucial for managing social networks (e.g., noonan et al., 2018) . this major brain network of the higher association cortex has probably recently expanded in primate evolution [121] . its constituent regions are often thought to support several of the most sophisticated neurocognitive processes [122, 123] . in monkeys, there is evidence that experimental manipulation of social group size results in adaptations in the volume of frontal brain regions, the posterior superior temporal sulcus or temporo-parietal junction, as well as the amygdala and other parts of the limbic system [105, 106] . in humans, there is evidence for structural coupling between social network size measured by number of online friends and parts of the default mode network, including the hippocampus [51] . from a clinical perspective, functional connectivity alterations in the default mode network have been demonstrated as a consequence of feelings of loneliness in younger adults [124] . moreover, the default mode network is especially subject to vulnerability in normal cognitive aging [125] , and is among the main brain circuits to be impacted by neuropathology in alzheimer's disease [126, 127] . complementing higher associative parts of the human social brain [128] , amygdala volume is larger in individuals with more extensive social networks in humans [52, 107] . amygdalar functional connectivity was also reported to increase with canonical brain networks implicated in face perception and approach-avoidance behaviour [107] . indeed, previous authors reported [129] that a patient with complete bilateral amygdala lesions lacked a sense of appropriate personal space vis-ã vis other people (figure 3 ). this patient exhibited no discomfort when at close distances from another person, even to the point of touching the other's nose -despite the fact that their conceptual understanding of people's private physical space was intact. in contrast, healthy individuals typically show amygdala activation in response to close personal proximity. in a similar vein, the grey-matter volume of the amygdala correlated negatively with social phobia [130] . the amygdala may hence be required to trigger the strong emotional reactions normally associated with personal space violations, thus regulating interpersonal distance in humans. such reports on the social brain often seemed to be in conflict about whether they highlight the prefrontal cortex or the amygdala of the limbic system. this apparent discrepancy was reconciled in a recent population neuroimaging study [131] : social traits such as daily exchange with family, friends, and work colleagues were associated with brain morphology in ~10,000 uk biobank participants. particularly prominent findings were reported in the limbic system, where volumes varied consistently with various indicators of social isolation. less socially stimulated participants showed volume effects in various parts of the social brain including the ventromedial prefrontal j o u r n a l p r e -p r o o f cortex and the amygdala, in addition to the nucleus accumbens of the reward circuitry. volume effects in these regions were reported for several markers of brittle social integration, such as living in a socially "emptier" household, knowing fewer individuals with whom to regularly share experiences and concerns, feeling unsatisfied with one's friendship circles, as well as having grown up without brothers or sisters and being unhappy with one's family situation [131] . this analysis also demonstrated wide-ranging sex differentiation in how traits of social isolation are linked to brain morphology. these findings underscore evidence from animals for a sex specific co-evolutionary relationship between the primate brain and social complexity [social brain hypothesis: 132, 133]. the perspective of brain network integration in loneliness was investigated in a seminal neuroimaging study of intrinsic functional connectivity in ~1,000 humans [124] . careful analysis showed that feelings of loneliness especially affect the neural communication strength between the limbic system and the default mode network as well as the communication strength inside of the default mode network. as a particularly discriminatory pattern for loneliness, impoverished functional modularity was found for the default mode network and its interacting brain networks. in contrast, a positive sense of one's meaning in life was linked to strengthened functional differentiation of the canonical network ensemble. the collective evidence led the investigators [124] to argue that the default mode network and its coupling partners represents a neural signature reflecting one's own purpose in life versus social disconnection to others. according to unicef estimates, ~140 million children worldwide live deprived of parents who could provide comfort and support. ~8 million of these children grow up in institutions without the socioemotional context of a regular family. in one of the earliest randomized clinical trials of its kind, orphans raised in institutions were systematically compared to orphans who were later welcomed into a foster home [134] . abandoned children were randomly assigned either to remain under the care of the institution or to transition to the care of foster-parents. their cognitive trajectories were monitored over several years. those children who remained in the institution showed significantly lower development indices and lower iqs [of around 70: 134] than the adopted orphans. being deprived of social bonds with caregivers also led to a pernicious reduction in grey-and white-matter tissue and lower fiber tract integrity as evidenced by brain mri [134] . institutional rearing was also shown to exacerbate the decay of the telomeres in cell nuclei [135, 136] . these protection caps normally prevent chromosome deterioration, which acts like a cellular sand clock of aging. their shortening has major consequences for various biological pathways and health outcomes. the younger the children were when adopted by a foster family, the better the cognitive performance later [137] . impoverished cognitive domains include memory and executive function: for orphans who transitioned to a foster home, some cognitive facets remained below-average throughout later life (e.g., short-term visual memory and attention allocation). other cognitive dimensions (e.g., visual-spatial memory and spatial working memory) caught up with a normal trajectory at age 16 [134] . such unique evidence underlines the fact that lack of socioemotional context in early life severely impedes brain development and maturation of the cognitive repertoire, which can be partially mitigated by developing social bonds to non-genetic parents (see box 2) . early psychosocial deprivation also shows inter-generational effects, which are probably mediated through maternal and epigenetic effects [138] . social isolation in childhood leads to molecular annotations of the genetic strand (such as methylation or phosphorylation of the histones that provide the structure for dna strands) that are passed on to influence how children cope with stress and in turn how they raise their own children. for instance, in rats, socioemotional experience as a pup has an impact on how the rat's own pups later deal with stress and high anxiety levels [139] . epigenetic regulation of gene transcription is involved in how maternal care promotes the rat pup's brain development and cognitive maturation. more licking and grooming by the mother increases protein expression of the grm1 gene in the pup's hippocampus. this up-regulated gene transcription leads to greater availability of glutamate receptor proteins in hippocampal cells for inter-neuronal signaling [140] . in humans, a longitudinal neuroimaging study indeed showed that social support from the mother promotes volume growth trajectories in the hippocampus, and predicts socioemotional development and emotion regulation in early adolescence [141] . in young rhesus monkeys, loss of social contact to the mother leads to behavioral aberrations that last right into adulthood. such social isolation was shown to entail down-regulated dendritic growth in the prefrontal cortex and reduction in gene expression in the amygdala [142] . social adversity undergone by children with institutional upbringing led to disturbed functional connectivity between the prefrontal cortex and the amygdala [143] . such perturbed brain maturation through social deprivation may be mediated by glucocorticoids, which are known to be inhibited by maternal care in primates [144] . hence, maternal care is a critical enrichment of the social environment that promotes maturation, expression of growth hormones, and synaptogenesis in various brain circuits. in contrast, social neglect leads to disturbed social attachment, as well as increased aggression and hyperactivity, often potentially lifelong [145, 146] . how vulnerable an individual is to parental deprivation is subject to complex nature-nurture interactions that are strongly conditioned on personality and overall genetic endowment [147, 148] . rats separated early from their mothers were impaired in adult life in emotion regulation and arousal management [149] . early socioemotional isolation of rat pups had impact on whether these rats later showed healthy responses to stress by mounting adequate cortisol levels [150] . hormones of the hypothalamic-pituitary-adrenocortical (hpa) axis are an important endocrine mechanism of stress neurobiology that plays a key role in social isolation. in baboon monkeys, infant survival is jeopardized for mothers who are more socially isolated and not well integrated in the local communities including ties to sisters, adult daughters, and other mothers [151] . monkey mothers with a thinner social network are less likely to have infants which themselves have high fitness [28] . female baboon monkeys with a larger close social circle of grooming partners have healthy cortisol levels and typically deal better with stressful situations [25, 26, 152] . when one of these strong social bonds is disrupted, such as when a close member of the social group is killed by predators, cortisol titres rise in the blood. such monkeys then tend to seek out new connections to "repair" the lost link in their social network [153] . a lower-than-usual cortisol level in the morning is indicative of extended stress periods in adults [154] . the same diurnal cortisol dynamic is frequently observed in disturbed child-caregiver relationships [155] . in rhesus monkeys, a low hormone response has been observed after repeated separations from the mother. the same observation has been reported for children who were moved between several caregivers. an intact child-caregiver relationship probably provides a stress reserve to adrenoreceptor responses so that children get over stressful episodes quicker [156, 157] . after undergoing adversity in early childhood, such as emotional or physical neglect, maltreatment, or maternal separation, enhancement of the child-caregiver relationship can mitigate the effect of previous hits to the hpa system. early disturbance in important social relationships is linked to dysfunctional cortisol hemostasis in adult life [158] . in some neglected children, ensuing problems and behavioral disruptions can even be exacerbated in adult life [159] . abnormal blood cortisol levels can potentially be prevented, mitigated or restored by family-based therapy and other interventions [160] . nonetheless, dysregulated diurnal cortisol levels are further linked to various mental disorders including major depression, substance abuse, and post-dramatic stress disorder [154] , in addition to stress-induced impact on the immune, cardiovascular, and metabolic systems [161, 162] . further insight into the neurobiology of social isolation has also been derived from rigorous experiments with adult primates (see also box 3). in one study, 20 monkeys were separated from others to live alone for 1.5 years [163] . subsequently, monkeys were re-integrated into social groups of four monkeys housed together. repeated positron emission tomography (pet) scanning revealed increased levels of d 2 receptors in the basal ganglia, which includes key nodes of the reward j o u r n a l p r e -p r o o f circuitry (see above), after being socially housed. this neurochemical adaptation in the monkeys' brain circuitry was apparent after as few as 3 months of social rehabilitation [163] . these authors also reported several differences in respect of social integration and social rank: monkeys of higher rank were groomed more by others. in contrast, subordinate monkeys spent more time by themselves. as a consequence at the behavioral level, the lower-rank monkeys were also significantly more willing to self-administer cocaine, which may also relate to heightened drug abuse in lonely humans [164] . such molecular imaging evidence shows that changing from social deprivation to an environment with constant social stimulation causes neural remodeling in the dopaminergic neurotransmitter pathways in non-human primates, which may be clinically relevant for substance abuse disorders in humans. we are social creatures. social interplay and cooperation have fuelled the rapid ascent of human culture and civilization. yet, social species struggle when forced to live in isolation. the expansion of loneliness has accelerated in the past decade. as one consequence, the united kingdom has launched the 'campaign to end loneliness' -a network of over 600 national, regional and local organizations to create the right conditions for reducing loneliness in later life. such efforts speak to the growing public recognition and political will to confront this evolving societal challenge. these prospects should encourage us to search for means to mitigate possible negative backlash. we offer some suggestions in box 4. additional insight into stress-responsive brain systems is imperative to tailor clinical decision making and therapeutic interventions to single individuals. there is also a dire need for additional longitudinal research on the hpa axis and the cortisol response to psychological stressors. we are grateful to guillaume dumas and tobias kalenscher for valuable comments on a previous version of the manuscript. db was supported by the healthy brains healthy lives initiative (canada first research excellence fund), and by the cifar artificial intelligence chairs program (canada institute for advanced research). primates service their relationships through social grooming. grooming triggers the endorphin system in the brain through a very specific neural system: the afferent ct fibres [165] . these axon bundles have receptors at the base of most hair follicles, have the unusual properties of being unmyelinated (and hence very slow, especially compared to the pain receptors in the skin), with no return motor loop (unlike pain and other proprioceptive neurons), respond to a very specific stimulus (light slow stroking at ~2.5cm per sec) and directly trigger the endorphin reward system [166] . although humans no longer have the full fur covering that encourages social grooming, we still have the receptors and instead use physical contact in the form of touching, stroking, caressing, and hugging as a means for strengthening social ties in our more intimate relationships [167, 168] . physical touch is intimate, and hence limited mainly to close family and friends ( figure 2 ). to bond our wider range of relationships as well as our more intimate ones, humans exploit a number of behaviours that turn out to trigger the endorphin system. these joint activities include laughing [169, 170] , singing [171, 172] , dancing [173, 174] , feasting [22] and emotional storytelling [175] . an important feature of all these behaviours is that behavioural synchrony seems to ramp up the level of endorphin release [174, 176] . in baby primates, close social interaction is not only beneficial, but critical for maturation and resilience. experiments in baby monkeys showed that upbringing in social isolation during the first years causes a variety of social deficits. when separated early from their mothers, baby monkeys showed strong symptoms of social withdrawal: self-hurting behaviour like biting, stereotypical and repetitive motor behaviour, excessive avoidance behavior towards others as well as poor social and maternal skills as adults. when separated later from their mothers, baby monkeys tended to indiscriminately approach unknown monkeys without fear [cf. 177] . reports of human children in some crowded russian and rumanian orphanages painted a strikingly similar picture: socially and emotionally abandoned children showed either forward-j o u r n a l p r e -p r o o f backward rocking tics and social escape or overly strong attachment style, analogous to neglected baby monkeys [cf. 178] . these cases invigorated the then-contested claim that mother-child bonds are indispensable for normal development, and that foster-care parents can compensate many of these needs [134, 160, 179, 180] . disruption of social interplay during critical development impacts negatively on cognitive, verbal, social and motor performance, and predisposes to mental health issues. in other words, early neglect remains measurable in brain and behaviour in later life. the socioemotional dialogue between caregiver and baby is mediated in several important ways. mothers speak to their offspring in "baby talk", which potentially evolved only recently in humans [181] . accompanied by direct face-to-face exchanges, these communication bouts with characteristic vocabulary and prosody promote infant development milestones. the interpersonal stimulation grabs the baby's attention, she gains weight faster, modulates her emotional state, and enhances various health outcomes. mother-infant communication is also delivered through direct skin-to-skin contact [166] . postnatal touching bolsters mother-infant bonding, alleviates anxiety, and provides intrinsic pleasure through endorphin release [182] [183] [184] . throughout life and quite independent of geography, primate societies are orchestrated by the creation, curation, and cultivation of social bonds though purposeful social closeness. among the many consequences of loneliness on body and mind, the scarcity of social contact encourages drug compensation behaviour, such as alcoholism, possibly via non-social rewards triggering dopaminergic neurotransmitter pathways [163] . at the genetic level, loneliness was shown to entail under-expression of anti-inflammatory genes involved in glucocorticoid response and over-expression of genes related to pro-inflammatory immune responses [185] . fortunately for future clinical intervention, loneliness may be a modifiable determinant in healthy aging [11] . as people grow older, the social network typically becomes smaller -naturally diminishing the cognitive stimulation through frequent and intense social interaction on a daily basis, thus potentially reducing the neural reserve. over the last century, the average human lifespan in developed nations has increased by nearly three decades. on the other hand, older people were also reported to show a decline in the capacity to take other people's point of view, as demonstrated in three separate mentalizing tasks [163] . these authors showed that social cognition j o u r n a l p r e -p r o o f deficits were related to decreased neural activity responses in the medial prefrontal default mode network [163] . this capacity is likely to be particularly important when introspecting other people's minds who are not physically present -where social cues like facial expression, mimics, and gestures are missing. both limited social stimulation and weakening social reflection capacities relate to the sense of loneliness in complicated and important ways [13] . once lonely, bias for negative information processing of cues from others hinders social rehabilitation in a downward cycle [4, 186] . many recent studies have corroborated the corpus of empirical evidence that the feelings of loneliness escalate the risk of certain neurological diseases and especially alzheimer's disease in later life [49] . social isolation at massive scale risks creating cohorts of individuals who are socially dysfunctional. it may therefore be important to identify ways of mitigating the worst of the effects so as to alleviate the consequences. the following possible countermeasures may be worth exploring: ï�· one promising intervention would involve creating opportunities where mutual social support relationships (friendships) could develop naturally. you cannot, however, force people to become friends: both parties need to be willing to devote resources to each other in a context where available time budget for social engagement is limited [187, 188] and there are competing friendship interests [66] . however, by providing more opportunities for people to meet in congenial environments, new friendships may blossom. ï�· social neuroscientists [189] undertook a longitudinal intervention study on 332 matched adults who underwent regular training sessions. several months of cognitive training improved empathy for others' affective state or perspective-taking of others' mental state, which resulted in structural remodeling in brain regions belonging to the social brain network, including the frontoinsular network and the default mode network. daily affective training resulted in thickening of the right anterior and mid-insula, with correspondingly enhanced compassion ratings. different training regimes correlated with different brain regions. further research is urgently needed to explore therapeutic interventions using training of social capacities in socially deprived humans. j o u r n a l p r e -p r o o f ï�· one important lesson is that joining clubs can have important benefits in reducing both a sense of loneliness and psychological or psychiatric conditions [21] . one obvious solution is to encourage vulnerable individuals to join social groups and communities that suit their interests and abilities. establishing a wide range of such clubs is likely to be much cheaper than paying for carehomes and prisons. ï�· singing is known to have a dramatic, immediate effect on creating a sense of social engagement and elevating psychological well-being [the "ice-breaker effect" : 171] . vulnerable individuals could be encouraged to join choirs and community singing groups. encouragement and funding may need to be invested in establishing a network of choirs. ï�· use of video-embedded digital communication is likely to gain in importance. this is especially true where family and friendship groups can meet in the same virtual space. the visual component of the interpersonal encounter appears to play a key role in creating a more satisfying experience of digital social media [87] . emotional closeness at the start of the study is set at 0 for both groups. redrawn from [61] . regions. in 1,368 people from several countries, this study investigated the permissibility of social touch [167] . the authors showed that human social touch is particularly dependent on the nature of the relationship. the topography of accepted social touching depends on many factors, including a) emotional relationship, b) type of interpersonal bond including kinship, c) sex, and d) power dynamics. close acquaintances and family members are touched for more different reasons. culture influence, measured in five countries, was small. female, rather than opposite-sex, touch was evaluated as more pleasant, and it was consequently allowed on larger bodily areas. reproduced from [167] . ï�· what further refinements of online digital media might improve people's function in creating and maintain friendships, especially for the housebound? it is insufficiently known which types of modern medium best mimic which neurocognitive facets of real social interaction. ï�· which neurobiological mechanisms explain how the default mode network and its connections to subordinate brain systems support higher social capacities, and their decline in social deprivation? this associative brain network needs to be more completely understood; especially regarding the congruencies and idiosyncrasies between healthy aging trajectories, the experience of social isolation, and vulnerability to neurodegenerative pathologies. in terms of progress towards causal understanding, putting a premium on longitudinal studies holds out unprecedented promise. ï�· across the entire lifespan, to what extent does reduced social stimulation or too few social contacts lead to loss in general capacities of the cognitive repertoire? how much do people struggling with cognitive load have issues maintaining many active social relationships? or both? progress in this chicken-and-egg problem will shed light on the aetiopathology of the loneliness, and usher towards new intervention strategies. the phenotype of loneliness evolutionary mechanisms for loneliness perceived social isolation and cognition the growing problem of loneliness loneliness and attention to social threat in young adults: findings from an eye tracker study the cultural context of loneliness: risk factors in active duty soldiers it is all about who you know: social capital and health in lowincome communities counting on kin: social networks, social support, and child health status impact of social connections on risk of heart disease, cancer, and all-cause mortality among elderly americans: findings from the second longitudinal study of aging (lsoa ii) associations of social networks with cancer mortality: a meta-analysis social and emotional support and its implication for health the effect of social relationships on survival in elderly residents of a southern european community: a cohort study social networks and health social isolation, social activity and loneliness as survival indicators in old age; a nationwide survey with a 7-year follow-up social relationships and mortality risk: a meta-analytic review loneliness and social isolation as risk factors for mortality: a meta-analytic review social isolation, loneliness, and all-cause mortality in older men and women the effect of widowhood on mortality by the causes of death of both spouses dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the framingham heart study connected: the surprising power of our social networks and how they shape our lives social group memberships protect against future depression, alleviate depression symptoms and prevent depression relapse breaking bread: the functions of social eating social structure as a strategy to mitigate the costs of group living: a comparison of gelada and guereza monkeys religiosity and religious attendance as factors in wellbeing and social engagement. religion focused grooming networks and stress alleviation in wild female baboons social stressors and coping mechanisms in wild female baboons (papio hamadryas ursinus) social affiliation matters: both same-sex and opposite-sex relationships predict survival in wild female baboons social bonds of female baboons enhance infant survival strong and consistent social bonds enhance the longevity of female baboons the benefits of social capital: close social bonds among female baboons enhance offspring survival network connections, dyadic bonds and fitness in wild female baboons social support reduces stress hormone levels in wild chimpanzees across stressful events and everyday affiliations family network size and survival across the lifespan of female macaques decoupling social status and status certainty effects on health in macaques: a network approach responses to social and environmental stress are attenuated by strong male bonds in wild macaques sociality increases juvenile survival after a catastrophic event in the feral horse (equus caballus) social bonds between unrelated females increase reproductive success in feral horses social and genetic interactions drive fitness variation in a free-living dolphin population loneliness, social network size, and immune response to influenza vaccination in college freshmen social connectedness is associated with fibrinogen level in a human social network opiate antagonist prevents î¼-and î´-opiate receptor dimerization to facilitate ability of agonist to control ethanol-altered natural killer cell functions and mammary tumor growth social relationships and physiological determinants of longevity across the human life span friends with health benefits: the long-term benefits of early peer social integration for blood pressure and obesity in midlife postweaning social isolation enhances morphological changes in the neonatal ventral hippocampal lesion rat model of psychosis abnormalities of presynaptic protein cdcrel-1 in striatum of rats reared in social isolation: relevance to neural connectivity in schizophrenia region-specific impairments in parvalbumin interneurons in social isolation-reared mice a critical period for social experience-dependent oligodendrocyte maturation and myelination feelings of loneliness, but not social isolation, predict dementia onset: results from the amsterdam study of the elderly (amstel) loneliness and risk of alzheimer disease loneliness is associated with sleep fragmentation in a communal society online social network size is reflected in human brain structure social brain volume is associated with in-degree social network size among older adults emotional arousal when watching drama increases pain threshold and social bonding social networks, support cliques, and kinship sex differences in social focus across the life cycle in humans extraverts have larger social network layers: but do not feel emotionally closer to individuals at any layer individual differences and personal social network size and structure social relationships and the emergence of social networks the anatomy of friendship how many hours does it take to make a friend? managing relationship decay: network, gender, and contextual effects the social brain hypothesis closeness, loneliness, support: core ties and significant ties in personal communities different strokes from different folks: community ties and social support hamilton's rule predicts anticipated social support in humans persistence of social signatures in human communication social network size in humans calling dunbar's numbers did distance matter before the internet? going that extra mile: individuals travel further to maintain face-to-face contact with highly related kin than with less related kin the strength of weak ties structure and function in human and primate social networks: implications for diffusion, network stability and health social interactions and well-being: the surprising power of weak ties limited communication capacity unveils strategies for human interaction what's in a smile? neural correlates of facial embodiment during social interaction fairness and cooperation are rewarding: evidence from social cognitive neuroscience alone in the crowd: the structure and spread of loneliness in a large social network attraction and close relationships a second-person approach to other minds joint attention, shared goals, and social bonding the computation of social behavior reward value of attractiveness and gaze ale meta-analysis on facial judgments of trustworthiness and attractiveness the eyes have it: the neuroethology, function and evolution of social gaze unique morphology of the human eye seeing it their way: evidence for rapid and involuntary computation of what other people see effects of duration and laughter on subjective happiness within different modes of communication: happiness and mode of communication technology-mediated communication in familial relationships: moderated-mediation models of isolation and loneliness human social networks the structure of online social networks mirrors those in the offline world the social network-network: size is predicted by brain structure and function in the amygdala and paralimbic regions ventromedial prefrontal volume predicts understanding of others and social network size the modular neuroarchitecture of social judgments on faces the social brain: allowing humans to boldly go where no other species has been characterization of the temporo-parietal junction by combining datadriven parcellation, complementary connectivity analyses, and functional decoding subspecialization within default mode nodes use of social network sites and instant messaging does not lead to increased offline social network size, or to emotionally closer relationships with offline network members coevolution of neocortical size, group size and language in humans games people play-toward an enactive view of cooperation in social neuroscience a survey method for characterizing daily life experience: the day reconstruction method gossip, reputation, and social adaptation higher order intentionality tasks are cognitively more demanding mental time travel into the past and the future in healthy aged adults: an fmri study social working memory predicts social network size in humans social network size affects neural circuits in macaques baboons (papio anubis) living in larger social groups have bigger brains amygdala volume and social network size in humans orbital prefrontal cortex volume predicts social network size: an imaging study of individual differences in humans ventromedial prefrontal volume predicts understanding of others and social network size orbital prefrontal cortex volume correlates with social cognitive competence gray matter volume of the anterior insular cortex and social networking segregation of the human medial prefrontal cortex in social cognition the brain structural disposition to social interaction the structural and functional brain networks that support human social networks orbital prefrontal cortex volume correlates with social cognitive competence neural mechanisms tracking popularity in real-world social networks adult attachment style is associated with cerebral î¼-opioid receptor availability in humans: opioids and attachment building blocks of social cognition: mirror, mentalize, share? social cognition and the brain: a meta-analysis are there theory of mind regions in the brain? a review of the neuroimaging literature surface-based and probabilistic atlases of primate cerebral cortex dark control: the default mode network as a reinforcement learning agent. hum brain mapp the brain's default network: anatomy, function, and relevance to disease loneliness and meaning in life are reflected in the intrinsic network architecture of the brain social-cognitive deficits in normal aging cortical hubs revealed by intrinsic functional connectivity: mapping, assessment of stability, and relation to alzheimer's disease structural covariance of the default network in healthy and pathological aging computing the social brain connectome across systems and states personal space regulation by the human amygdala reduced amygdalar and hippocampal size in adults with generalized social phobia 000 social brains: sex differentiation in human brain anatomy. science advances neocortex evolution in primates: the 'social brain' is for females primate brain evolution: genetic and functional considerations cognitive recovery in socially deprived young children: the bucharest early intervention project telomere length and early severe social deprivation: linking early adversity and cellular aging accelerated telomere shortening: tracking the lasting impact of early institutional care at the cellular level long-term effects of institutional rearing, foster care, and brain activity on memory and executive functioning nongenomic transmission across generations of maternal behavior and stress responses in the rat behavior of adult rats is modified by the experiences their mothers had as infants variations in postnatal maternal care and the epigenetic regulation of metabotropic glutamate receptor 1 expression and hippocampal function in the rat preschool is a sensitive period for the influence of maternal support on the trajectory of hippocampal development amygdala gene expression correlates of social behavior in monkeys experiencing maternal separation early developmental emergence of human amygdala-prefrontal connectivity after maternal deprivation psychobiological mechanisms underlying the social buffering of the hypothalamic-pituitary-adrenocortical axis: a review of animal models and human studies across development maternal care, gene expression, and the transmission of individual differences in stress reactivity across generations early adversity and critical periods: neurodevelopmental consequences of violating the expectable environment beyond diathesis stress: differential susceptibility to environmental influences biological sensitivity to context: ii. empirical explorations of an evolutionary-developmental theory critical periods for the effects of infantile experience on adult learning maternal and environmental influences on the adrenocortical response to stress in weanling rats social relationships among adult female baboons (papio cynocephalus) i. variation in the strength of social bonds sociophysiology of relationships in squirrel monkeys. i. formation of female dyads behavioural and hormonal responses to predation in female chacma baboons ( papio hamadryas ursinus ) stress and disorders of the stress system the potential role of hypocortisolism in the pathophysiology of stressrelated bodily disorders maternal behavior predicts infant cortisol recovery from a mild everyday stressor dampening of adrenocortical responses during infancy: normative changes and individual differences disentangling psychobiological mechanisms underlying internalizing and externalizing behaviors in youth: longitudinal and concurrent associations with cortisol effects of a therapeutic intervention for foster preschoolers on diurnal cortisol activity importance of studying the contributions of early adverse experience to neurobiological findings in depression the influence of social hierarchy on primate health social dominance in monkeys: dopamine d2 receptors and cocaine selfadministration loneliness and alcohol abuse: a review of evidences of an interplay the neurophysiology of unmyelinated tactile afferents social touch modulates endogenous î¼-opioid system activity in humans topography of social touching depends on emotional bonds between humans cross-cultural similarity in relationship-specific social touching sharing the joke: the size of natural laughter groups social laughter triggers endogenous opioid release in humans the ice-breaker effect: singing mediates fast social bonding singing and social bonding: changes in connectivity and pain threshold as a function of group size naltrexone blocks endorphins released when dancing in synchrony synchrony and exertion during dance independently raise pain threshold and encourage social bonding emotional arousal when watching drama increases pain threshold and social bonding rowers' high: behavioural synchrony is correlated with elevated pain thresholds social recovery by isolation-reared monkeys research network on early experience and brain development early adverse care romania's abandoned children function of infant-directed speech stroking modulates noxious-evoked brain activity in human infants the communicative functions of touch in humans, nonhuman primates, and rats: a review and synthesis of the infants autonomic cardio-respiratory responses to nurturing stroking touch delivered by the mother or the father social regulation of gene expression in human leukocytes examining the visual processing patterns of lonely adults time as a limited resource: communication strategy in mobile phone networks cognitive resource allocation determines the organization of personal networks structural plasticity of the social brain: differential change after socioaffective and cognitive mental training key: cord-015861-lg547ha9 authors: kang, nan; zhang, xuesong; cheng, xinzhou; fang, bingyi; jiang, hong title: the realization path of network security technology under big data and cloud computing date: 2019-03-12 journal: signal and information processing, networking and computers doi: 10.1007/978-981-13-7123-3_66 sha: doc_id: 15861 cord_uid: lg547ha9 this paper studies the cloud and big data technology based on the characters of network security, including virus invasion, data storage, system vulnerabilities, network management etc. it analyzes some key network security problems in the current cloud and big data network. above all, this paper puts forward technical ways of achieving network security. cloud computing is a service that based on the increased usage and delivery of the internet related services, it promotes the rapidly development of the big data information processing technology, improves the processing and management abilities of big data information. with tie rapid development of computer technology, big data technology brings not only huge economic benefits, but the evolution of social productivity. however, serials of safety problems appeared. how to increase network security has been become the key point. this paper analyzes and discusses the technical ways of achieving network security. cloud computing is a kind of widely-used distributed computing technology [1] [2] [3] . its basic concept is to automatically divide the huge computing processing program into numerous smaller subroutines through the network, and then hand the processing results back to the user after searching, calculating and analyzing by a large system of multiple servers [4] [5] [6] . with this technology, web service providers can process tens of millions, if not billions, of information in a matter of seconds, reaching a network service as powerful as a supercomputer [7, 8] . cloud computing is a resource delivery and usage model, it means get resource (hardware, software) via network. the network of providing resource is called 'cloud'. the hardware resource in the 'cloud' seems scalable infinitely and can be used whenever [9] [10] [11] . cloud computing is the product of the rapid development of computer science and technology. however, the problem of computer network security in the background of cloud computing brings a lot of trouble to people's life, work and study [12] [13] [14] . therefore, scientific and effective management measures should be taken in combination with the characteristics of cloud computing technology to minimize the risk of computer network security and improve the stability and security of computer network. this paper briefly introduces cloud computing, analyzes the network security problem of computer under cloud computing, and expounds the network security protection measures under cloud computing. processing data by cloud computing can save the energy expenditure and reduce the dealing cost of big data, so that it can improve the healthy development of cloud computing technology. analyzing big data by cloud computing technology can be shown by a directed acyclic data flow graph g ¼ ðv; eþ, and the cloud service module in the parallel selection mechanism is made up by a serial group v ¼ fiji ¼ 1; 2; . . .; vg and a serial of remote data transfer hidden channels e ¼ fði; jþji; j 2 vg. assuming the date transmission distance of the data flow model in c=s framework is the directed graph model gp ¼ ðvp; ep; scapþ explanation, ep represent lkset, the vp cross channel bearing the physical node set, the scap explains the quantity of data unit of physical node. besides, assuming undirected graph gs ¼ ðvs; es; sarsþ expresses data packet markers input by application. the process of link mapping between cloud computing components and overall architecture can be explained by: for the different customer demands, building an optimized resource-allocated model to build the application model that processed by big data. the built-in network link structure for big data information processing as follows: in fig. 1 , the i th transmission package in the cloud computer is i th . let ti represent the transmission time of i th . the interval of component is mapped to thread or process is showed by j i ¼ t i à t d , when j i ¼ t i à t d in the range of (−∞, ∞), the weight of node i is w i which computing time, the detail application model of big data information processing is shown in fig. 2 in the mobile cloud system model, the grid architecture that relies on local computing resources and the wireless network to build cloud computing, which will select the components of data flow graph to migrate to the cloud, computer data processing cloud computing formula modeling, fgðv; eþ; si; di; jg is the given data flow applications, assuming that the channel capacity is infinite, the problem of using cloud computing technology to optimize big data information processing is described as follows maxmax xi;yi;jxi;yi;j among them: the energy overhead of data flow migrating between groups in mobile cloud computing is described as: 4 main characteristics of network security technology in the context of big data, cloud computing, users can save the data in the cloud and then process and manage the data. compared with the original network technology, it has certain data network risks, but its security coefficient is higher. cloud security technology can utilize modern network security technology to realize centralizing upgrade and guarantee the overall security of big data. since the data is stored in the cloud, enhancing the cloud management is the only way to ensure the security of the data. big data stored in the cloud usually affects network data. most enterprises will connect multiple servers so as to build computing terminals with strong performance. cloud computing itself has the convenience. customers of its hardware facilities do not need to purchase additional services. they only need to purchase storage and computing services. due to its particularity, cloud computing can effectively reduce resource consumption and is also a new form of energy conservation and environmental protection. when local computers encounter risks, data stored in the cloud will not be affected, nor will it be lost, and at the same time these data will be shared. the sharing and transfer of raw data is generally based on physical connections, and then data transfer is implemented. compared with the original data research, data sharing in big data cloud computing can be realized by using the cloud. users can collect data with the help of various terminals, so as to have a strong data sharing function. most computer networks have risks from system vulnerabilities. criminals use illegal means to make use of system vulnerabilities to invade other systems. system vulnerabilities not only include the vulnerabilities of the computer network system itself, but also can easily affect the computer system due to the user's downloading of unknown plug-ins, thus causing system vulnerability problems. with the continuous development of the network, its virus forms are also diverse, but mainly refers to a destructive program created by human factors. due to the diversity of the virus, the degree of impact is also different. customer information and files of enterprises can be stolen by viruses, resulting in huge economic losses, and some of the viruses are highly destructive, which will not only damage the relevant customer data, but also cause network system paralysis. in the context of big data cloud computing, external storage of the cloud computing platform can be realized through various distributed facilities. the service characteristic index of the system is mainly evaluated through high efficiency, security and stability. storage security plays a very important role in the computer network system. computer network system has different kinds, large storage, the data has diversified characteristics. the traditional storage methods have been unable to meet the needs of social development. optimizing the data encryption methods cannot meet the demand of the network. the deployment of cloud computing data and finishing need data storage has certain stability and security, to avoid economic losses to the user. in order to ensure data security, it is necessary to strengthen computer network management. all computer managers and application personnel are the main body of computer network security management. if the network management personnel do not have a comprehensive understanding of their responsibilities and adopt an unreasonable management method, data leakage will occur. especially for enterprise, government and other information management, network security management is very important. in the process of application, many computers do not pay enough attention to network security management, leading to the crisis of computer intrusion, thus causing data exposure problems. 6 ways to achieve network security one of the main factors influencing the big data cloud save system is data layout. exploring it at the present stage is usually combined with the characteristics of the data to implement the unified layout. management and preservation function are carried out through data type distribution, and the data is encrypted. the original data stored in more than one cloud, different data management level has different abilities to resist attacks. for cloud computing, data storage, transmission and sharing can apply encryption technology. during data transmission, the party receiving the data can decrypt the encrypted data, so as to prevent the data from being damaged or stolen during the transmission. the intelligent firewall can identify the data through statistics, decision-making, memory and other ways, and achieve the effect of access control. by using the mathematical concept, it can eliminate the large-scale computing methods applied in the matching verification process and realize the mining of the network's own characteristics, so as to achieve the effect of direct access and control. the intelligent firewall technology includes risk identification, data intrusion prevention and outlaw personnel supply warning. compared with the original firewall technology, the intelligent firewall technology can further prevent the network system from being damaged by human factors and improve the security of network data. the system encryption technology is generally divided into public key and private key with the help of encryption algorithm to prevent the system from being attacked. meanwhile, service operators are given full attention to monitor the network operation and improve the overall security of the network. in addition, users should improve their operation management of data. in the process of being attacked by viruses, static and dynamic technologies are used. dynamic technologies are efficient in operation and can support multiple types of resources. safety isolation system is usually called virtualizes distributed firewalls (vdfw). it made up of security isolation system centralized management center and security service virtual machine (svm). the main role of this system is to achieve network security. the key functions of the system are as follows. access control functions analyze source/destination ip addresses, mac address, port and protocol, time, application characteristics, virtual machine object, user and other dimensions based on state detection access control. meanwhile, it supports many functions, including the access control policy grouping, search, conflict detection. intrusion prevention module judge the intrusion behavior by using protocol analysis and pattern recognition, statistical threshold and comprehensive technical means such as abnormal traffic monitoring. it can accurately block eleven categories of more than 4000 kinds of network attacks, including overflow attacks, rpc attack, webcgi attack, denial of service, trojans, worms, system vulnerabilities. moreover, it supports custom rules to detect and alert network attack traffic, abnormal messages in traffic, abnormal traffic, flood and other attacks. it can check and kill the trojan, worm, macro, script and other malicious codes contained in the email body/attachments, web pages and download files based on streaming and transparent proxy technology. it supports ftp, http, pop3, smtp and other protocols. it identifies the traffic of various application layers, identify over 2000 protocols; its built-in thousands of application recognition feature library. this paper studies the cloud and big data technology. in the context of large data cloud computing, the computer network security problem is gradually a highlight, and in this case, the computer network operation condition should be combined with the modern network frame safety technology, so as to ensure the security of the network information, thus creating a safe network operation environment for users. application and operation of computer network security prevention under the background of big data era research on enterprise network information security technology system in the context of big data self-optimised coordinated traffic shifting scheme for lte cellular systems network security technology in big data environment data mining for base station evaluation in lte cellular systems user-vote assisted self-organizing load balancing for ofdma cellular systems discussion on network information security in the context of big data telecom big data based user offloading self-optimisation in heterogeneous relay cellular systems application of cloud computing technology in computer secure storage user perception aware telecom data mining and network management for lte/lte-advanced networks selfoptimised joint traffic offloading in heterogeneous cellular networks network information security control mechanism and evaluation system in the context of big data mobility load balancing aware radio resource allocation scheme for lte-advanced cellular networks wcdma data based lte site selection scheme in lte deployment key: cord-102776-2upbx2lp authors: niu, zhibin; cheng, dawei; zhang, liqing; zhang, jiawan title: visual analytics for networked-guarantee loans risk management date: 2017-04-06 journal: nan doi: 10.1109/pacificvis.2018.00028 sha: doc_id: 102776 cord_uid: 2upbx2lp groups of enterprises guarantee each other and form complex guarantee networks when they try to obtain loans from banks. such secured loan can enhance the solvency and promote the rapid growth in the economic upturn period. however, potential systemic risk may happen within the risk binding community. especially, during the economic down period, the crisis may spread in the guarantee network like a domino. monitoring the financial status, preventing or reducing systematic risk when crisis happens is highly concerned by the regulatory commission and banks. we propose visual analytics approach for loan guarantee network risk management, and consolidate the five analysis tasks with financial experts: i) visual analytics for enterprises default risk, whereby a hybrid representation is devised to predict the default risk and developed an interface to visualize key indicators; ii) visual analytics for high default groups, whereby a community detection based interactive approach is presented; iii) visual analytics for high defaults pattern, whereby a motif detection based interactive approach is described, and we adopt a shneiderman mantra strategy to reduce the computation complexity. iv) visual analytics for evolving guarantee network, whereby animation is used to help understanding the guarantee dynamic; v) visual analytics approach and interface for default diffusion path. the temporal diffusion path analysis can be useful for the government and bank to monitor the default spread status. it also provides insight for taking precautionary measures to prevent and dissolve systemic financial risk. we implement the system with case studies on a real-world guarantee network. two financial experts are consulted with endorsement on the developed tool. to the best of our knowledge, this is the first visual analytics tool to explore the guarantee network risks in a systematic manner. financial safety is a main concern of the government and banks. the majority of small and medium enterprises (smes) are difficult to get loan from the banks for their limited credit qualification, thus they often need to seek loan guarantees. in fact, guaranteed loan is already an important way to raise money in addition with seeking listed. in some developed economy like in us and uk, special government backed banks are established to provide guarantee credit [22, 27, 30, 40, 55] ; while in emerging economies like korea [19] and china [31] , it is more common that the corporations guarantee each other when they are trying to secure loans from lending institutions. it is reported that a quarter of the $13 trillion in total outstanding loans in china are guaranteed loans in 2014 [40] and there is an 18% year-to-year increase [36] . this has led to a noticeable new phenomenon: a large amount of corporations back each other and form complex guarantee networks. appropriate guarantee union may reduce the default risk but contagious damages over the networked enterprises may happen in practice. with the economic down period, large -scale breach of contract would hazard the banking asset quality deteriorated seriously and cause systematic crisis. although the loan guarantee network appeared for less than twenty years and it is still not well understood. the current financial academic community published some qualitative analysis works on small guarantee networks and there is few quantitative analysis research. in banking industry, the credit assessors evaluate an enterprise basically on the basis of classic credit rating approach. such a approach is not well suited for the complex benefit relationships. the risk management for the loan guarantee network is challenging: firstly, the loan guarantee network may consist of thousands of enterprises with complex guarantee relationships and intertwined risk factors, making it very difficult to analysis. fig. 1 illustrates a real guarantee network we constructed using ten years of bank loan records and it consists of more than 1000 enterprises, each of which has more than 3000 financial entries. monitoring the financial status is so difficult that usually only after capital chain rupture, can the regulators study the case in-depth. secondly, the fact that small and medium enterprise business operations (for example, the loan officers do not access to the enterprise net assets information) have inadequate transparency makes the loan risk evaluation more difficult. some borrowers fraudulently obtain loans using the faultiness of bank lending risk managements. the cognition to risk loan guarantee especially malicious guarantee is still relatively limited. thirdly, thousands of guarantee networks of different complexities coexist for a long period and evolve over time, this requires adaptive strategy to prevent, identification and dismantling systematic crisis. in the complex background of the growth period, the structural adjustment of the pain period and the early stage of the stimulus period, the structural and deep-level contradictions emerged in the economic development, all kinds of risk factors along the guarantee network accelerate the risk transmission and amplification, the guarantee network may be alienated from "mutual aid group" as "breach of contract". in this paper, we propose visual analytics approach for loan guarantee network risk management. it includes visual analytics for i) enterprises default risk; ii) high default groups; iii) high default pattern; iv) evolving guarantee network; and v) default diffusion path. in a nutshell, the main contributions are: 1. we consolidate with financial experts and identify five key research problems for loan guarantee network risk management, which is driven by emerging finance industry demands, and we believe this is an important research problem to the visual analytics science and technology community; 2. we propose intuitive visual analytics approaches for the tasks of i) enterprises default risk; ii) high default groups; iii) high default pattern; iv) evolving guarantee network; and v) default diffusion path. 3. we construct real loan guarantee network and perform empirical study on ten years of bank loan records. we highlight three high default patterns which are difficult to be discovered without visual analytic approach. we conduct interviews with two banking loan experts and got endorsed. the rest of the paper is organized as following: section 2 describes works involving different aspects related to our problem; section 3 details the five visual analytic tasks and our approaches; section 4 describe the data, case study; and we report user study results in section 5. conclusions and future works are described in section 6. to our best knowledge, this is the first work of visual analysis for the loan guarantee network risk management. we thus introduce several relevant work on network analytics in the financial domain; anomalous and significant subgraph detection in attributed networks; and works on financial security visualization. credit risk evaluation consumer credit risk evaluation is often technically addressed in a data-driven fashion and has been extensively investigated [5, 24] . since the seminal "partial credit" model [39] , numerous statistical approaches are introduced for credit scoring, including logistic regression [60] , k-nn [26] , neural network [18] , support vector machine [28] . more recently, [4] presents an in-depth analysis on how to interpret and visualize the learned knowledge embedded in the neural networks using explanatory rules. the authors in [32] combine debt-to-income ratio with consumer banking transactions, and use a linear regression model with timewindowed data set to predict the default rates in a short future. they claim a 85% default prediction accuracy and can save cost between 6% and 25%. financial network analytics financial crises and systemic risk have always been a major concern [9, 21] . networks or graph is a natural representation of the financial systems as they often bear complex interdependence and connections inside [2] . the relationship between network structure and financial system risk are carefully studied and several insights have been drawn: network structure has few impact for system welfare but plays an important role in determining systemic risk and welfare in short-term debt [3] . after the 2008 global financial crisis, network theory attracts more attention: the crisis brought by lehman brothers spreads on connected corporations in a similar infectious way as the epidemic of severe acute respiratory syndrome (sars) in 2002 -both are small damage that hits a networked system and causes serious events [8, 13] . the journal of nature physics organizes a special on how to understand some fundamental economic issues using network theory [1] . these publications suggest the applicability of network based financial model. for example, the dynamic network produced by bank overnight funds loan may be an alert of the crisis [13] . contrary to the conventional stereotype that large institutions are "too big to fail", the truth is the position of the institution in the network is equally and sometimes more important than its size [6] . more central the vertex is to the graph, more influential it is to the whole economic network when default occurs [13] . moreover, the research that aims to understand individual behavior and interactions in the social network, has also attracted extensive attention [7, 20, 46, 47, 61, 62, 67] . although preliminary efforts have been made using network theory to understand fundamental problems in financial systems [12, 17, 64] , there is little work on the system risk analysis in the loan guarantee network except for the preliminary work [41] . among them, may be the most important work is using k-shell decomposition to predict the default rate; positive correlation between the k-shell decomposition value of the network and default rates was reported [41] . anomalous and significant subgraph detection in network anomalous and significant subgraphs have been applied in many domains such as societal events in social media, new business discovery, auction fraud, fake reviews, email spams, false advertising [42, 54] . classic anomalous and significant subgraphs refer to subgraphs, in which the behaviors (attributes) of the nodes or edges are significantly different from the behaviors of those outside the subgraphs [48] . anomalous and significant subgraphs in social network can be used for early detection of emerging events such as civil unrest prediction, rare disease outbreak detection, and early detection of human rights events. the heterogeneous social network is modeled as a sensor network in which each node senses its local neighborhood, computes multiple features, and reports the overall degree of anomalousness. p-values of the subgraphs are used to represent the significance, and iterative subgraph expansion are used for the scaling problem [15] . emerging events such as crimes or disease cases are detected from spatial networks [34, 44] . a common challenge for the subgraph detection is the complexity. as many of the algorithms are turned into subgraph isomorphism problem which is n-p complete problem, it is computationally infeasible for naive search. algorithms are designed to optimize the performance. readers are referred to [43, 58, 59, 68] for more details. visualization in financial systems financial risk is a major concern of the government and the banks. visual analysis can enhance the understanding and communication of risk, help to analysis risks and prevent systemic risks. this is done by developing interpretable models, and and couple them with visual, interactive interfaces. in modern banking industry the business becomes more and more complex, the risk assessment and risk loan pattern detection have attracted a major concern. animation is used to visually analysis large amounts of time-dependent data [63] . in [29] , 3d tree map are introduced to monitor the real-time stock market performance and to identify a particular stock that produced an unusual trading patterns. interactive exploratory tool is designed to help the casual decision-maker quickly choose between various financial portfolios [50] . coordinated specific keywords visualization within wire transactions are used to detect suspicious behaviors [14] . the self-organizing map (som), a neural network based visualization tool is often used in financial risk visualization analysis, for monitoring the sovereign defaults occurrence in less developed countries [52] , visual analysis of the evolution of currency crises by comparing the clusters of crises between decades [51] , and discovering imbalances in financial networks [53] . self-organizing time map (sotm) are used to decompose and identify temporal structural changes in macro financial data around the global financial crisis in 2007 and 2009. readers are referred to [37] for more references on financial visualization. we consult with financial experts and consolidate five analysis tasks. in this section, we give an brief introduction in the first place before describing detailed algorithm, strategy and interactions. fig. 2 gives the overview of the system and tasks. we first construct the real loan guarantee networks from bank record, perform statistical analysis and employ machine learning based approach to predicate enterprise default risk. all these data are fit into the interface to finish the tasks proposed by financial experts. specifically, the tasks include: t1: visual analytics for enterprise default risk. the current internal loan credit rating system is based on the pure financial status of the individual borrower. credit assessor can usually access to the first layer of guarantee chain, and could not trustfully evaluate the entire guarantee network. in order to avoid inadequate risk assessment, it is necessary to carry out a systematic analysis of the enterprise. t2: visual analytics for high default group. identifying the high default groups helps the banking experts single out and tackle the principal default problem. visual analytics tools should be developed for thoroughly analyzing of the network, and recognize high defaults enterprises. t3: visual analytics for high default pattern. some known guarantee patterns may lead to default and diffusion, but there may exist more complex patterns which is difficult to be discovered. this task requires visualize the known risk guarantee pattern and able to explore other more complex risk guarantee patterns. t4: visual analytics for evolving guarantee network. like many other real networks, there are competitive decision making taking place in the guarantee network. understanding the network dynamic helps financial experts understand how the firms are connected together temporally. this task requires visualizing the guarantee network evolution based on history data. t5: visual analytics for default propagation path. before the crisis, forecasting the default diffusion path and monitoring the default spread status helps the government and bank take precautionary measures, conduct research, and take effective measures to prevent and dissolve risks, such that no regional or systemic financial risk occur. default risk predication the loan records reveal that the guarantee network and default rates are both increasing, and the network structures show strong correlation with the defaults. we construct feature vector consisting of hybrid information and employ supervised learning approach to train the prediction model. in what follows, we discuss the hybrid features used in our model. in order to build a highly representative feature which can reliably reflect the statistical relationships between the customers information and their repayment ability, we clean the data and construct the features as: basic profile, the essential company registration information, which reflects the character, capital, collateral, capability, condition and stability [41] . we use business nature, registered capital, enterprise scale, employee number and others as corporation's basic profile. most banks require company to update the basic information when the enterprise makes a loan application, and we choose to use the latest information as the basic profile features of the loan. credit behavior, historical behavior e.g. credit history, default records, default amount, total loan amount and loan count, total loan frequency (if any), total default rates. they are calculated by all the loan records before the active loan contract. active loan, the loan contract in its execution period. it contains active loan amount, active loan times, type of capital return and interest return etc. network structure, network features such as centralities are extracted as ns. note that as discussed above, the basic profile may be not completely trustworthy as the smes may provide out of date or even fake information to the bank. however, the guarantee network is trustable information as the bank can build it from its own record systems. the prediction of default for a customer's loan guarantee can be modeled as a supervised learning problem. we use logistic regression based on gradient boosting tree [23] for the predication. the tree ensemble model using k additive function to prediction output can be represented as: in eq. 1, f k is the k th decision tree, x i is the training feature andå· i is predication results.finding parameters of the tree model is turned into minimize the objective function problem and it can be trained in an additive manner [16] . where where â�� i l(å· i , y i ) is a training loss function measures the difference between the prediction and the target; â�¦( f ) is a smoothing regularization term to avoid over fitting. specifically, we use three-month window for training, observation, predication, and evaluation. as fig. 3 shows, in the training stage, for all customers who obtain bank loans from 2013 q1(first quarter 1. prediction shall be adapted to a dynamic setting with a regularly updated forecasting results. in fact, using sliding window is a typical way for rolling prediction as commonly adopted in event prediction practices such as [65, 66] . 2. the business often runs on a quarterly basis. thus from a business demand perspective, it would be helpful to know the borrowers who may be default on a quarterly basis. default risk visualization. we design and implemented visual interface enable to view the network with various multiple measurements. fig. 4 gives the interface, by which users can adjust the node size by the predicted default risk and by the following network centrality measurements: hub score and authority score, k-shell decomposition score, pagerank, eigenvector centrality scores, betweenness centrality, closeness centrality. fig. 5 gives a part visualization of a real guarantee network. in the graph, all defaulted enterprises are highlighted by red circles. node size proportional to predicted risk (a), k-shell value (b), and authority score (c). through the interface, users can also observe the rolling prediction risk of an enterprise over month and highlight it on the whole network by choosing it on the heatmap. recognizing high default groups narrow down the risk guarantee relationship search scope and enable financial experts focus on firms with high-default crowd. usually, community detection divides the guarantee network into groups (communities) based on how the nodes are connected together. theoretically, community structure in graph is defined as the node set internally interacts with each other more frequently than with those outside it. identifying such sub-structures provides insight into understanding the structure of complex networks (both functions and topology affect each other) [57] . based on the conjecture that defaults occur in clusters, we first divide the whole network into several disjoint sets by community detection. fig. 6 (a) shows the results on a typical independent subgraph we constructed from the bank loan records. the communities are marked using separate color background and average default rates are labeled. there are 30 communities, but the default occurs on four of them with average 38% to 8.6% defaults rates, all other 9 communities have no default during the guarantee network existence. similar phenomenon are observed on random walks, edge betweenness, and spinglass community. in practice, we first use random walk algorithm [45, 49] to divide the whole guarantee networks into groups. we use a revised treemap interface to visualize the community detection results. the community label and default rates are displayed on the flat colored blocks.the treemap chart used for navigation here, thus the sum of area does not necessarily to be one. the larger blocks reveal the high default communities saliently. however, the evaluation of community detection is still an open question [35] , and the community detection algorithm only considers the link information and neglect node attribute information, the partition may not be consonant to the actual conditions. the basic rule for community detection is to minimize the number of links between communities and this uses pure network structure information. in financial practice, each node in the network comes with rich information such as enterprise sectors, changes in deposits, assets, loan amount, etc,. it would be unreliable discarding such attributes when dividing the network. by interaction, we enable the users to edit the communities into coherent ones by referring to relevant financial matric. we allow users to interactively perform the following manipulation actions. interactive community editing. we enable users to explore the financial information and interactively edit the communities by merging strong associated communities, reassign the community labels for the structural hole spanners, a key role in the information diffusion [11] or split a community into several disjoin smaller groups. the generated subgraph are noted as group of interest (goi), the high risk guarantee pattern are often hidden in the goi. reassign. the reassign operation allow to the change the community labels of the structure hole spanner. structure hole spanner is the bridge node which connect different communities in a network. fig. 7 is reproduced from [25] , and it illustrates a network with three communities and six structural hole spanners. empirical study suggests that individuals would benefit from filling the "holes" (called as structural hole spanners) between people or groups that are otherwise disconnected [10] . principled methodology to detect structural hole spanners from a given social network are still not clear [38] . in fact, we observed high default on structure hole spanners with their neighbouring internal nodes. we enable the users can investigate the financial matric and reassign the community labels of the structure hole spanners. specially, when the user wish to merge two adjacent communities, he/she firstly double clicks one block on the tree map, all the other connected commonties are highlighted. single clicking the structure hole spanner node can reassign it into the opposite community. for example, when community c 1 and c 2 are chosen, single click node a, both communities will be merged as c 1 , and vice versa. merge. neighbouring communities can be merged. as the community detection divides a graph purely based on links in the graph, algorithm may generate too many communities where some of them share common sector category or similar network structures. merging the communities referring the financial matric can produce medium size and more tractable subgraphs. specifically, when the user wish to merge two adjacent communities; he firstly double click one "tile" on the tree map, all other the connected commonties are highlighted. double click the structure hole spanner node can merge the two commonties together and labeled as the clicked community. split. sometimes, we need to split the community into several parts. this happen when there exists when the default unevenly distributed. we can cut off the stable parts and this may reduce the moi computation complexity. specifically, when the user wish to split the community; he firstly double click one "tile" on the tree map, all other the connected commonties are highlighted. double click the edge, the two opposite parts of the subgraph will be split into two communities. financial information is useful. we use a financial radar chart to encode the key financial status under the tree map. specially, the key indices include: defaults, historic default behavior; la/rc the ratio of loan amount to registered capital. it would more insightful using the ratio of loan amount to enterprise net assets, however, as the latter one is not always available. we use registered capital instead. deposit loss the percentage of deposit loosing. the shorting of money and rapid decrease of deposit should not be ignored. sector the enterprise sector is also important clue when editing communities. ga/rc the ratio of guarantee amounts to registered capital. as the loan guarantee is an obligation of a borrower if that borrower defaults, the ratio of guarantee amount to enterprise net assets is a crucial factor for the financial systematic stability. similarly, we use registered capital instead. credit rating it is the review rating of bank expert, which is also a key clue when editing communities. usually, high default pattern discovery is not possible by observation as a practical loan guarantee network may consist of several tens of thousands nodes; nor does it via algorithms -naive subgraph mining from the network led to isomorphism problem which is proved to be np-complete problem. we adopt a shneiderman mantra strategy to reduce the computation complexity. guarantee circle visualization. the small and medium firms improve their borrowing capacity by a third party guarantor. empirical studies by bank risk control specialist suggest the guarantee circle is a source of default risk. the most frequently used guarantee circle patterns include mutual guarantee, joint liability guarantee, star shape guarantee loan,and revolving guarantee (see fig. 8 ). such interactions are legal in china currently. they can enhance the solvency level to some extent but may induce occurrence of risks and transmission of risk pointed by financial regulatory documents. often the specialists in the bank risk control department have only sql query capability to find relative simple guarantee pattern. in this work, we enable automatically guarantee circle detection and visualization -the common recognized risk loan patterns including mutual guarantee, co-guarantee, and revolving guarantee are highlighted on the network. fig. 9 gives an example of revolving loan guarantee detected from a real-world loan guarantee network. users are able to focus on the relevant firms and explore more details. besides, there are five firms default among the eleven firms in the three revolving structure, informing the banking experts to pay more attention on the firms involved in such patterns. new risk pattern discovery. as mentioned above, guarantee circles are relatively clearly understood by banking experts. however, they still can not quite understand does there exist more complicated guarantee patterns that may have implicit connections with high default phenomenon. we develop a visual analytics tool to help the experts discover and understand what have happened. the task is challenging: arbitrary guarantee pattern which has high default rate can be underneath the complex network structures. it is impossible to exhaustively compare all network patterns to determine whether it is in high default. based on the conjecture that defaults occur in clusters, we propose an interactive shneiderman mantra strategy [56] to narrow down the risk guarantee pattern searching space. fig. 2 gives the processing flow. because the goi are groups with high default rates, there may exist guarantee patterns which are prone to default. usually, the motifs are the most basic building blocks for a network and the number of structures are limited. motifs may reflect functional properties and provide a deep insight into the networks. a complex guarantee network is always connected by several smaller subgraphs bridged by the structural hole spanners. the sub-graphs inside the communities may reveal certain risk even fraud pattern. in this work, we obtain a set of motifs by first detecting motifs from the goi. the motifs are ranked by their default rates (eq. (4) ). among them, high default rate motifs are noted as pattern of interest (poi) and they may need be investigated by banking experts in priority. where m is a motif. all motifs are possible risk loan guarantee patterns. however, it is still computationally challenging to obtain all pois by the approach above for the following reasons. firstly, motif structures increases with the node number increase rapidly, for example, 4 node motif has over 3000 possibilities. it is impossible to enumerate all motif structures. secondly, motif matching is exhaustively searched from the query graph into the large network, and it is essences subgraph isomorphism problem. it still takes too much time for motifs with more nodes to be matched on the network. in this work, we propose an interactive motif editing approach. users can further explore the financial information of adjacent nodes and add them to the motifs and generate poi. network evolution over time is observed from the guarantee network. the topology of the network keeps changing -some nodes are connected to the network or removed from it, some communities are connected together through the guarantee of the structural hole spanner. like many other real networks, there are competitive decision making taking place in the guarantee network: when a firm lack security to obtain a loan from the bank, it may resort to a guarantee corporation or thirty party firms. to some extent, the new guarantors may improve the overall system rationality but also may induce unstable factor as the network becomes even more complex. understanding the network dynamic helps to financial experts understand how the firms are connected together temporal. in this work, we use animation to visualize the evolving of guarantee network. users can drag the time bar to backtrack how the network evolve over time. they are allowed to hover mouse cursor over the node to view the company's financial information. this will help the financial experts understand what has happened historically. fig. 10 gives an example how a real network evolve from july 2013 to april 2014. combining enterprises financial status of different time, financial experts would be able to make analysis. financial systematic risks is a top concern for the government and banks, however, as a new phenomenon, the understanding to the systematic risk of the loan guarantee network is still not sufficient. sophisticated guarantee relationships tend to cause credit granted by multiple lenders and excessive credit. in the loan guarantee, a guarantor has the debt obligation if the borrower defaults, if the guarantee could not payback to the back, it may resort to its guarantors. in this case, the default may propagate like virus.the default contagious increases the possibility of occurrence of risks and transmission of risks. especially in the economic downturn, some enterprises face operation difficulties and the financial crisis will have a domino effect: the default phenomenon may spread rapidly in the network, and this will make a large number of enterprises fall into unfavorable situation. the government and the banks always wish to monitor the default spread status and understand the complexity of the current issue of risks before they can take precautionary measures, conduct research, and take effective measures to prevent and dissolve risks, to ensure that no regional or systemic financial risk occur. based on the relevant knowledge and experience, we develop the visual analytics tool to aid the default path discovery by visualization. a principle of the default diffusion can be described as the vulnerable nodes are the guarantors. fig. 11 gives a diffusion path illustration. (a) is a guarantee network with eight nodes, where node e provides guarantee to five adjacent nodes and c, d provide guarantee to b and then to a; (b) is the possible diffusion path, the default of node a may lead to the b, c, d even e default. it is noted that node g, f, h are not connected with node e, as the default of e will not affect the repayment status of g,f, and h. in practice, there may be multiple possible propagation path as each node can serve as guarantor or get guaranteed. it is difficult to outline the main propagation path from the entire. we make the following assumption: the node on multiple propagation pathes is the key to prevent large scale default diffusion and thus should be highlighted. we compute all the propagation pathes, count occurrences and highlight the node on the network. we use the color to illustrate the propagation risk importance. we design the visual analytics tool which enables financial experts take into account of several factors on the judgment of defaults. the factors include the financial information of the corporation and guarantee contract amount information. the former information is plain listed when the user hovers the mouse pointer on the node, while a sankey diagram is used to represent the guarantee flow. the widths of the sankey diagram bands are directly proportional to the guarantee amount. fig. 12 (a) gives results on a real guarantee network, when we choose one node, for example, node 32, the whole potential propagation path is highlighted in (b), and (c) is the corresponding sankey diagram. it can be seen that upstream companies usually provides more guarantee than they received. for example, node 18 provides much more guarantee than it receives. the imbalance of guarantee amount and collateral amount provide clue for the credit line assessment. the real situation is even much more complex. the default may be diffused like a virus infection and the virion must identify and bind to its receptor (guarantor). as mentioned earlier, each enterprises has more than 3000 financial entries, it is difficult to quantify anti-infective ability for each enterprises. we enable users to look up multiple financial status and cut off the propagation path. we also note that the propagation model provides more insights to end users and we plan to perform in-depth study for the topic and provide simulation interface in the future. we collect loan records spanning ten years from a major commercial bank in china. the names of the customers in the records are encrypted and replaced by an id; we can access the basic profile like the enterprise scale, the loan information like the guarantee id and loan credit. we first introduce the loan process, and then explain how the information are extracted and cleaned. the banks need to collect as much fine-grained information as possible, concerning the repayment ability of the enterprise. the information falls into four categories: transaction information, customer information, asset information such as mortgage status, history loan approval bank side record, etc. the most relevant to the loan guarantees are eight data tables: customer profile, loan account information, repayment status, guarantee profile, customer credit, loan contract, guarantee relationship, guarantee contract, default status. there are often more than one guarantors for one loan transaction, and there may be several loan transactions for a single guarantor in a period. once the loan is approved, the smes usually can obtain the full size of loan immediately, and start to repay to the bank regularly by an installment plan until the end of the loan contract. in the record preprocess phase, by joining the nine tables, we obtain records related to the corporation id and loan contracts. we then construct the guarantee network and compute the network related measurements. we now report the observations derived from the data. overall statistics there are 11,000 loan customers, which span 60,948 mutual guarantee relationships derived from 36,618 loan contracts. there are 5,911 defaults during the past ten years, out of the total 87,307 repayments. the overall default rate to the number of contracts is 6.77%. centrality indicators are helpful to identify the relative importance of nodes in the network. fig. 13 gives the histogram of several most complex subgraphs on how the defaults distributed with different centrality indicator values. it is noted defaults happen more on nodes with large authority value and small hub values. this is consistent with intuition -the enterprise works as the hub ones back a large number of other corporations and it is supposed to be relatively stable and operates in good condition. in contrast, the enterprise works as the authority ones and accepts guarantee from many other corporations and this means they lack funds security and have higher risk in trouble. the statistics indicate the lender to watch the status of the "authority" high nodes in the guarantee network. although the underlying assumption of pagerank is quite alike authority score, we did not observe similar correlation between the values and default rates (see fig. 13 ). it is observed that the larger the centrality the higher default rates. the tasks were as follows: (1) visual analytics for high default groups; and (2) visual analytics for high default pattern. the first case study is to find high default groups. the random walk community detection algorithm divides the guarantee network into 36 communities. the statistics are given in table 1 . we edit the community following basic guidelines: (1) consider default status, loan amount and other financial statistics comprehensively; (2) mall communities can be either merged with its neighbouring large communities or pruned. for example, community 35,34 both have 4 nodes and these firms never default. there is low possibility they will become high default groups in the future; while the community 23 be merged with the neighbouring communities. (3) structural hole spanner nodes should be paid special attention. usually, there are defaults happen on the structural hole spanners, the adjacent communities can be merged. finally, we obtain ten communities and seven of them has relative high default rates as table 2 . the seven medium sized groups of subgraphs which can be efficiently processed for further tasks. it is noted that the merge and reassign operation are based on the user expertise. as the user may choose various criteria, the final tree map can demonstrate different combinations and default rates. in this subsection, we explore high default patterns beyond guarantee circles. it includes (1) automatic motif detection from high default groups. specifically, we employ the gtriescanner (http://www.dcc.fc.up.pt/gtries/) approach. (2) matching the motifs with the entire network and calculate the ratio for default firms. (3) ranking the motifs in descending defaults order, and they are high default patterns. (4) the user interactively edits the high default patterns by adding more nodes, and the system will automatically match the new subgraph with the entire network and produce the ratio for default firms. theoretically, there are 199 and 9,364 possibility combinations for 4-and 5-vertex-motifs [33] for a directed network, respectively. matching all those motifs on the whole network would be time-consuming. the user interactively editing motifs helps more efficient to explore new patterns. in practice, we choose to analyze community 3, which consists of 103 enterprises; 36% of them default the 85% loans from the bank, as table 2 shows. fig. 15 gives the twenty 4-vertex-motifs automatic algorithm detected from community 3, and table 3 shows the statistical information. although there are nearly 200 kinds of 4 vertex node motif shapes, there are only 20 existing in the high default group. we thus perform analysis on the 20 motifs instead of every shape. the detailed motif shapes are given in fig. 15 . most of the them have rather complex structures, however, some of them are known to banking experts, for example, motif 6 is joint liability loan. some others can be understood by a combinations of smaller guarantee patterns. for example, motif 5 is a combination of joint liability with a single guarantee. three of the motifs, motif 15, 16, and 17 attracted our attention. (1) high default rates for the patterns (ranging from 61% to 90% in ratio for default firm and 55% to 100% in ratio for default amount); (2) relative small number of instances (4 or 5) are detected from the whole network. besides, (3) the top five risk motifs show single input, single output, feed forward structures. fig. 16 gives all the pattern 15 instances detected from the entire network. some of motif instances coincide together. these three patterns are interesting, for example, pattern 15 recurrent for five times in a group, the bank lost all the money lend to the enterprises with such guarantee structures(see table 3 ). there is high possibility that fraud loan guarantee may happen for several times; and local bank failed to recognize the fraud pattern. similar analysis implies pattern 16 and 17 may be also guarantee patterns with high default. we then conduct interviews with two banking loan experts. the first one comes from the financial regulator. the expert has more than five years of guarantee network research experience and has published several important investigation reports and books on the chinese loan guarantee network status. the second one comes from a major commercial bank credit department; who has ten years of loan approval experience. both experts are attracted by and understand the visualization guarantee relationships immediately. the first expert is rather interested with the community editing. he said they try to resolve the financial risks in guarantee network, a major operation is to split the loan guarantee network into smaller ones with risks isolated. in this case, health enterprises will not be affected by financially risk enterprises. the editing function of our tool provides them a powerful weapon to achieve their target. besides, the expert also has interest in the risk guarantee pattern discovery module, and he agrees the significant value provided by the finding of such risk patterns. there might exist illegally convey benefits under the suggested high default patterns. the expert will also dive into the financial disclo-sures of the risk guarantee enterprises and examine whether fraud guarantees are happening. the second expert expressed that he has never grasp the whole intercalations between enterprises so clear when assessing a loan. the expert claims the tree map gives an intuitive understanding about the guarantee groups. we present visual analytics approach for loan guarantee network risk management in this paper. to our best knowledge, this is the first work using visualization analysis approaches to address the guarantee network default risk issue. we design and implement interactive interface to analysis the individual enterprises default risk, high default groups, patterns in the group, network evolution and default diffusion path. the analysis can help the government and bank monitoring default spread status and provides insight for taking precautionary measures to prevent and dissolve systemic financial risk. future work will include computational modeling of default diffusion and visual analytics for taking precautionary measures. net gains networks in finance financial connections and systemic risk using neural network rule extraction and decision tables for credit-risk evaluation benchmarking state-of-the-art classification algorithms for credit scoring debtrank: too central to fail? financial networks, the fed and systemic risk network analysis in the social sciences complex financial networks and systemic risk: a review bubbles, financial crises, and systemic risk structural holes and good ideas1 secondhand brokerage: evidence on the importance of local structure for managers, bankers, and analysts the making of a transnational capitalist class: corporate power in the twenty-first century. zed books network opportunity wirevis: visualization of categorical, time-varying data from financial transactions non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs xgboost: a scalable tree boosting system social network, social trust and shared goals in organizational knowledge sharing. information & management a comparison of neural networks and linear scoring models in the credit union environment analysis of loan guarantees among the korean chaebol affiliates social network sites: definition, history, and scholarship rasch models: foundations, recent developments, and applications the effect of credit scoring on small-business lending greedy function approximation: a gradient boosting machine. annals of statistics statistical classification methods in consumer credit scoring: a review joint community and structural hole spanner detection via harmonic modularity a k-nearest-neighbour classifier for assessing consumer credit risk. the statistician hmrc, department for business innovation & skills. 2010 to 2015 government policy: business enterprise credit scoring with a data mining approach based on support vector machines. expert systems with applications a visualization approach for frauds detection in financial market determinants of the guarantee circles: the case of chinese listed firms determinants of the guarantee circles: the case of chinese listed firms consumer credit-risk models via machine-learning algorithms network motif detection: algorithms, parallel and cloud computing, and related tools a spatial scan statistic benchmark graphs for testing community detection algorithms china faces default chain reaction as credit guarantees backfire modelling dependence with copulas and applications to risk management mining structural hole spanners through information diffusion in social networks a rasch model for partial credit scoring loan 'guarantee chains' in china prove flimsy credit risk evaluation for loan guarantee chain in china efficient anomaly detection in dynamic, attributed graphs: emerging phenomena and big data fast subset scan for spatial pattern detection detection of emerging space-time clusters finding and evaluating community structure in networks complex networks in the study of financial and social systems the lifecycle and cascade of wechat social messaging groups anomaly detection in dynamic networks: a survey maps of random walks on complex networks reveal community structure finvis: applied visual analytics for personal financial planning clustering the changing nature of currency crises in emerging markets: an exploration with self-organising maps sovereign debt monitor: a visual self-organizing maps approach chance discovery with self-organizing maps: discovering imbalances in financial networks anomaly detection in online social networks the eyes have it: a task by data type taxonomy for information visualizations general optimization technique for high-quality community detection in complex networks scalable detection of anomalous patterns with connectivity constraints penalized fast subset scanning a credit scoring model for personal loans relational learning via latent social dimensions community detection and mining in social media applying animation to the visual analysis of financial time-dependent data using social network knowledge for detecting spider constructions in social security fraud sales pipeline win propensity prediction: a regression approach towards effective prioritizing water pipe replacement and rehabilitation evaluation without ground truth in social media research graph-structured sparse optimization for connected subgraph detection figure 14 : high default groups after interactive editing. motif key: cord-028688-5uzl1jpu authors: li, peisen; wang, guoyin; hu, jun; li, yun title: multi-granularity complex network representation learning date: 2020-06-10 journal: rough sets doi: 10.1007/978-3-030-52705-1_18 sha: doc_id: 28688 cord_uid: 5uzl1jpu network representation learning aims to learn the low dimensional vector of the nodes in a network while maintaining the inherent properties of the original information. existing algorithms focus on the single coarse-grained topology of nodes or text information alone, which cannot describe complex information networks. however, node structure and attribution are interdependent, indecomposable. therefore, it is essential to learn the representation of node based on both the topological structure and node additional attributes. in this paper, we propose a multi-granularity complex network representation learning model (mnrl), which integrates topological structure and additional information at the same time, and presents these fused information learning into the same granularity semantic space that through fine-to-coarse to refine the complex network. experiments show that our method can not only capture indecomposable multi-granularity information, but also retain various potential similarities of both topology and node attributes. it has achieved effective results in the downstream work of node classification and the link prediction on real-world datasets. complex network is the description of the relationship between entities and the carrier of various information in the real world, which has become an indispensable form of existence, such as medical systems, judicial networks, social networks, financial networks. mining knowledge in networks has drown continuous attention in both academia and industry. how to accurately analyze and make decisions on these problems and tasks from different information networks is a vital research. e.g. in the field of sociology, a large number of interactive social platforms such as weibo, wechat, facebook, and twitter, create a lot of social networks including relationships between users and a sharp increase in interactive review text information. studies have shown that these large, sparse new social networks at different levels of cognition will present the same smallworld nature and community structure as the real world. then, based on these interactive information networks for data analysis [1] , such as the prediction of criminal associations and sensitive groups, we can directly apply it to the real world. network representation learning is an effective analysis method for the recognition and representation of complex networks at different granularity levels, while preserving the inherent properties, mapping high-dimensional and sparse data to a low-dimensional, dense vector space. then apply vector-based machine learning techniques to handle tasks in different fields [2, 3] . for example, link prediction [4] , community discovery [5] , node classification [6] , recommendation system [7] , etc. in recent years, various advanced network representation learning methods based on topological structure have been proposed, such as deepwalk [8] , node2vec [9] , line [10] , which has become a classical algorithm for representation learning of complex networks, solves the problem of retaining the local topological structure. a series of deep learning-based network representation methods were then proposed to further solve the problems of global topological structure preservation and high-order nonlinearity of data, and increased efficiency. e.g., sdne [13] , gcn [14] and dane [12] . however, the existing researches has focused on coarser levels of granularity, that is, a single topological structure, without comprehensive consideration of various granular information such as behaviors, attributes, and features. it is not interpretable, which makes many decision-making systems unusable. in addition, the structure of the entity itself and its attributes or behavioral characteristics in a network are indecomposable [18] . therefore, analyzing a single granularity of information alone will lose a lot of potential information. for example, in a job-related crime relationship network is show in fig. 1 , the anti-reconnaissance of criminal suspects leads to a sparse network than common social networks. the undiscovered edge does not really mean two nodes are not related like p2 and p3 or (p1 and p2), but in case detection, additional information of the suspect needs to be considered. the two without an explicit relationship were involved in the same criminal activity at a certain place (l1), they may have some potential connection. the suspect p4 and p7 are related by the attribute a4, the topology without attribute cannot recognize why the relation between them is generated. so these location attributes and activity information are inherently indecomposable and interdependence with the suspect, making the two nodes recognize at a finer granularity based on the additional information and relationship structure that the low-dimensional representation vectors learned have certain similarities. we can directly predict the hidden relationship between the two suspects based on these potential similarities. therefore, it is necessary to consider the network topology and additional information of nodes. the cognitive learning mode of information network is exactly in line with the multi-granularity thinking mechanism of human intelligence problem solving, data is taken as knowledge expressed in the lowest granularity level of a multiple granularity space, while knowledge as the abstraction of data in coarse granularity levels [15] . multi-granularity cognitive computing fuses data at different granularity levels to acquire knowledge [16] . similarly, network representation learning can represent data into lower-dimensional granularity levels and preserve underlying properties and knowledge. to summarize, complex network representation learning faces the following challenges: information complementarity: the node topology and attributes are essentially two different types of granular information, and the integration of these granular information to enrich the semantic information of the network is a new perspective. but how to deal with the complementarity of its multiple levels and represent it in the same space is an arduous task. in complex networks, the similarity between entities depends not only on the topology structure, but also on the attribute information attached to the nodes. they are indecomposable and highly non-linear, so how to represent potential proximity is still worth studying. in order to address the above challenges, this paper proposes a multigranularity complex network learning representation method (mnrl) based on the idea of multi-granularity cognitive computing. network representation learning can be traced back to the traditional graph embedding, which is regarded as a process of data from high-dimensional to lowdimensional. the main methods include principal component analysis (pca) [19] and multidimensional scaling (mds) [21] . all these methods can be understood as using an n × k matrix to represent the original n × m matrix, where k m. later, some researchers proposed isomap and lle to maintain the overall structure of the nonlinear manifold [20] . in general, these methods have shown good performance on small networks. however, the time complexity is extremely high, which makes them unable to work on large-scale networks. another popular class of dimensionality reduction techniques uses the spectral characteristics (e.g. feature vectors) of a matrix that can be derived from a graph to embed the nodes. laplacian eigenmaps [22] obtain low-dimensional vector representations of each node in the feature vector representation graph associated with its k smallest non-trivial feature values. recently, deepwalk was inspired by word2vec [24] , a certain node was selected as the starting point, and the sequence of the nodes was obtained by random walk. then the obtained sequence was regarded as a sentence and input to the word2vec model to learn the low-dimensional representation vector. deep-walk can obtain the local context information of the nodes in the graph through random walks, so the learned representation vector reflects the local structure of the point in the network [8] . the more neighboring points that two nodes share in the network, the shorter the distance between the corresponding two vectors. node2vec uses biased random walks to make a choose between breadthfirst (bfs) and depth-first (dfs) graph search, resulting in a higher quality and more informative node representation than deepwalk, which is more widely used in network representation learning. line [10] proposes first-order and secondorder approximations for network representation learning from a new perspective. harp [25] obtains a vector representation of the original network through graph coarsening aggregation and node hierarchy propagation. recently, graph convolutional network (gcn) [14] significantly improves the performance of network topological structure analysis, which aggregates each node and its neighbors in the network through a convolutional layer, and outputs the weighted average of the aggregation results instead of the original node's representation. through the continuous stacking of convolutional layers, nodes can aggregate high-order neighbor information well. however, when the convolutional layers are superimposed to a certain number, the new features learned will be over-smoothed, which will damage the network representation performance. multi-gs [23] combines the concept of multi-granularity cognitive computing, divides the network structure according to people's cognitive habits, and then uses gcn to convolve different particle layers to obtain low-dimensional feature vector representations. sdne [13] directly inputs the network adjacency matrix to the autoencoder [26] to solve the problem of preserving highly nonlinear first-order and second-order similarity. the above network representation learning methods use only network structure information to learn low-dimensional node vectors. but nodes and edges in real-world networks are often associated with additional information, and these features are called attributes. for example, in social networking sites such as weibo, text content posted by users (nodes) is available. therefore, the node representation in the network also needs to learn from the rich content of node attributes and edge attributes. tadw studies the case where nodes are associated with text features. the author of tadw first proved that deepwalk essentially decomposes the transition probability matrix into two low-dimensional matrices. inspired by this result, tadw low-dimensionally represents the text feature matrix and node features through a matrix decomposition process [27] . cene treats text content as a special type of node and uses node-node structure and node-content association for node representation [28] . more recently, dane [12] and can [34] uses deep learning methods [11] to preserve poten-tially non-linear node topology and node attribute information. these two kinds of information provide different views for each node, but their heterogeneity is not considered. anrl optimizes the network structure and attribute information separately, and uses the skip-gram model to skillfully handle the heterogeneity of the two different types of information [29] . nevertheless, the consistent and complementary information in the topology and attributes is lost and the sensitivity to noise is increased, resulting in a lower robustness. to process different types of information, wang put forward the concepts of "from coarse to fine cognition" and "fine to coarse" fusion learning in the study of multi-granularity cognitive machine learning [30] . people usually do cognition at a coarser level first, for example, when we meet a person, we first recognize who the person is from the face, then refine the features to see the freckles on the face. while computers obtain semantic information that humans understand by fusing fine-grained data to coarse-grained levels. refining the granularity of complex networks and the integration between different granular layers is still an area worthy of deepening research [17, 31] . inspired by this, divides complex networks into different levels of granularity: single node and attribute data are microstructures, meso-structures are role similarity and community similarity, global network characteristics are extremely macro-structured. the larger the granularity, the wider the range of data covered, the smaller the granularity, the narrower the data covered. our model learns the semantic information that humans can understand at above mentioned levels from the finest-grained attribute information and topological structure, finally saves it into low-dimensional vectors. let g = (v, e, a) be a complex network, where v represents the set of n nodes and e represents the set of edges, and a represents the set of attributes. in detail, a ∈ n×m is a matrix that encodes all node additional attributes information, and a i ∈ a describes the attributes associated with node represents an edge between v i and v j . we formally define the multi-granularity network representation learning as follows: , we represent each node v i and attribute a i as a low-dimensional vector y i by learning a functionf g : |v | and y i not only retains the topology of the nodes but also the node attribute information. definition 2. given network g = (v, e, a). semantic similarity indicates that two nodes have similar attributes and neighbor structure, and the lowdimensional vector obtained by the network representation learning maintains the same similarity with the original network. e.g., if v i ∼ v j through the mapping function f g to get the low-dimensional vectors y i = f g (v i ), y j = f g (v j ), y i and y j are still similar, y i ∼ y j . complex networks are composed of node and attribute granules (elementary granules), which can no longer be decomposed. learning these grains to get different levels of semantic information includes topological structure (micro), role acquaintance (meso) and global structure (macro). the complete low-dimensional representation of a complex network is the aggregation of these granular layers of information. in order to solve the problems mentioned above, inspired by multi-granularity cognitive computing, we propose a multi-granularity network representation learning method (mnrl), which refines the complex network representation learning from the topology level to the node's attribute characteristics and various attachments. the model not only fuses finer granular information but also preserves the node topology, which enriches the semantic information of the relational network to solve the problem of the indecomposable and interdependence of information. the algorithm framework is shown in fig. 2 . firstly, the topology and additional information are fused through the function h, then the variational encoder is used to learn network representation from fine to coarse. the output of the embedded layer are low-dimensional vectors, which combines the attribute information and the network topology. to better characterize multiple granularity complex networks and solve the problem of nodes with potential associations that cannot be processed through the relationship structure alone, we refine the granularity to additional attributes, and designed an information fusion method, which are defined as follows: where n (v i ) is the neighbors of node v i in the network, a i is the attributes associated with node v i . w ij > 0 for weighted networks and w ij = 1 for unweighted networks. d(v j ) is the degree of node v j . x i contains potential information of multiple granularity information, both the neighbor attribute information and the node itself. to capture complementarity of different granularity hierarchies and avoid the effects of various noises, our model in fig. 1 is a variational auto-encoder, which is a powerful unsupervised deep model for feature learning. it has been widely used for multi-granularity cognitive computing applications. in multi-granularity complex networks, auto-encoders fuse different granularity data to a unified granularity space from fine to coarse. the variational auto-encoder contains three layers, namely, the input layer, the hidden layer, and the output layer, which are defined as follows: here, k is the number of layers for the encoder and decoder. σ (·) represents the possible activation functions such as relu, sigmod or tanh. w k and b k are the transformation matrix and bias vector in the k-th layer, respectively. y k i is the unified vector representation that learning from model, which obeys the distribution function e, reducing the influence of noise. e ∼ (0, 1) is the standard normal distribution in this paper. in order to make the learned representation as similar as possible to the given distribution,it need to minimize the following loss function: to reduce potential information loss of original network, our goal is to minimize the following auto-encoder loss function: wherex i is the reconstruction output of decoder and x i incorporates prior knowledge into the model. to formulate the homogeneous network structure information, skip-gram model has been widely adopted in recent works and in the field of heterogeneous network research, skip-grams suitable for different types of nodes processing have also been proposed [32] . in our model, the context of a node is the low-dimensional potential information. given the node v i and the associated reconstruction information y i , we randomly walk c ∈ c by maximizing the loss function: where b is the size of the generation window and the conditional probability p (v i+j |y i ) is defined as the softmax function: in the above formula, v i is the node context representation of node v i , and y i is the result produced by the auto-encoder. directly optimizing eq. (6) is computationally expensive, which requires the summation over the entire set of nodes when computing the conditional probability of p (v i+j |y i ). we adopt the negative sampling approach proposed in metapath2vec++ that samples multiple negative samples according to some noisy distributions: where σ(·) = 1/(1 + exp(·)) is the sigmoid function and s is the number of negative samples. we set p n (v) ∝ d 3 4 v as suggested in wode2vec, where d v is the degree of node v i [24, 32] . through the above methods, the node's attribute information and the heterogeneity of the node's global structure are processed and the potential semantic similarity kept in a unified granularity space. multi-granularity complex network representation learning through the fusion of multiple kinds of granularity information, learning the basic granules through an autoencoder, and representing different levels of granularity in a unified low-dimensional vector solves the potential semantic similarity between nodes without direct edges. the model simultaneously optimizes the objective function of each module to make the final result robust and effective. the function is shown below: in detail, l re is the auto-encoder loss function of eq. (4), l kl has been stated in formula (3), and l hs is the loss function of the skip-gram model in eq. (5) . α, β, ψ, γ are the hyper parameters to balance each module. l v ae is the parameter optimization function, the formula is as follows: where w k ,ŵ k are weight matrices for encoder and decoder respectively in the kth layer, and b k ,b k are bias matrix. the complete objective function is expressed as follows: mnrl preserves multiple types of granular information include node attributes, local network structure and global network structure information in a unified framework. the model solves the problems of highly nonlinearity and complementarity of various granularity information, and retained the underlying semantics of topology and additional information at the same time. finally, we optimize the object function l in eq. (10) through stochastic gradient descent. to ensure the robustness and validity of the results, we iteratively optimize all components at the same time until the model converges. the learning algorithm is summarized in algorithm 1. algorithm1. the model of mnrl input: graph g = (v, e, a), window size b, times of walk p, walk length u, hyperparameter α, β, ψ, γ, embedding size d. output: node representations y k ∈ d . 1: generate node context starting p times with random walks with length u at each node. 2: multiple granularity information fusion for each node by function h (·) 3: initialize all parameters 4: while not converged do 5: sample a mini-batch of nodes with its context 6: compute the gradient of ∇l 7: update auto-encoder and skip-gram module parameters 8: end while 9: save representations y = y k datasets: in our experiments, we employ four benchmark datasets: facebook 1 , cora, citeseer and pubmed 2 . these datasets contain edge relations and various attribute information, which can verify that the social relations of nodes and individual attributes have strong dependence and indecomposability, and jointly determine the properties of entities in the social environment. the first three datasets are paper citation networks, and these datasets are consist of bibliography publication data. the edge represents that each paper may cite or be cited by other papers. the publications are classified into one of the following six classes: agents, ai, db, ir, ml, hci in citeseer and one of the three classes (i.e., "diabetes mellitus experimental", "diabetes mellitus type 1", "diabetes mellitus type 2") in pubmed. the cora dataset consists of machine learning papers which are classified into seven classes. facebook dataset is a typical social network. nodes represent users and edges represent friendship relations. we summarize the statistics of these benchmark datasets in table 1 . to evaluate the performance of our proposed mnrl, we compare it with 9 baseline methods, which can be divided into two groups. the former category of baselines leverage network structure information only and ignore the node attributes contains deepwalk, node2vec, grarep [33] , line and sdne. the other methods try to preserve node attribute and network structure proximity, which are competitive competitors. we consider tadw, gae, vgae, dane as our compared algorithms. for all baselines, we used the implementation released by the original authors. the parameters for baselines are tuned to be optimal. for deepwalk and node2vec, we set the window size as 10, the walk length as 80, the number of walks as 10. for grarep, the maximum transition step is set to 5. for line, we concatenate the first-order and second-order result together as the final embedding result. for the rest baseline methods, their parameters are set following the original papers. at last, the dimension of the node representation is set as 128. for mnrl, the number of layers and dimensions for each dataset are shown in table 2 . table 2 . detailed network layer structure information. citeseer 3703-1500-500-128-500-1500-3703 pubmed 500-200-128-200-500 cora 1433-500-128-500-1433 facebook 1238-500-128-500-1238 to show the performance of our proposed mnrl, we conduct node classification on the learned node representations. specifically, we employ svm as the classifier. to make a comprehensive evaluation, we randomly select 10%, 30%, 50% nodes as the training set and the rest as the testing set respectively. with these randomly chosen training sets, we use five-fold cross validation to train the classifier and then evaluate the classifier on the testing sets. to measure the classification result, we employ micro-f1 (mi-f1) and macro-f1 (ma-f1) as metrics. the classification results are shown in table 3 , 4, 5 respectively. from these four tables, we can find that our proposed mnrl achieves significant improvement compared with plain network embedding approaches, and beats other attributed network embedding approaches in most situations. experimental results show that the representation results of each comparison algorithm perform well in node classification in downstream tasks. in general, a model that considers node attribute information and node structure information performs better than structure alone. from these three tables, we can find that our proposed mnrl achieves significant improvement compared with single granularity network embedding approaches. for joint representation, our model performs more effectively than most similar types of algorithms, especially in the case of sparse data, because our model input is the fusion information of multiple nodes with extra information. when comparing dane, our experiments did not improve significantly but it achieved the expected results. dane uses two auto-encoders to learn and express the network structure and attribute information separately, since the increase of parameters makes the optimal selection in the learning process, the performance will be better with the increase of training data, but the demand for computing resources will also increase and the interpretability of the algorithm is weak. while mnrl uses a variational auto-encoder to learn the structure and attribute information at the same time, the interdependence of information is preserved, which handles heterogeneous information well and reduces the impact of noise. in this subsection, we evaluate the ability of node representations in reconstructing the network structure via link prediction, aiming at predicting if there exists an edge between two nodes, is a typical task in networks analysis. following other model works do, to evaluate the performance of our model, we randomly holds out 50% existing links as positive instances and sample an equal number of non-existing links. then, we use the residual network to train the embedding models. specifically, we rank both positive and negative instances according to the cosine similarity function. to judge the ranking quality, we employ the auc to evaluate the ranking list and a higher value indicates a better performance. we perform link prediction task on cora datasets and the results is shown in fig. 3 . compared with traditional algorithms that representation learning from a single granular structure information, the algorithms that both on structure and attribute information is more effective. tadw performs well, but the method based on matrix factorization has the disadvantage of high complexity in large networks. gae and vgae perform better in this experiment and are suitable for large networks. mnrl refines the input and retains potential semantic information. link prediction relies on additional information, so it performs better than other algorithms in this experiment. in this paper, we propose a multi-granularity complex network representation learning model (mnrl), which integrates topology structure and additional information, and presents these fused information learning into the same granularity semantic space that through fine-to-coarse to refine the complex network. the effectiveness has been verified by extensive experiments, shows that the relation of nodes and additional attributes are indecomposable and complementarity, which together jointly determine the properties of entities in the network. in practice, it will have a good application prospect in large information network. although the model saves a lot of calculation cost and well represents complex networks of various granularity, it needs to set different parameters in different application scenarios, which is troublesome and needs to be optimized in the future. the multi-granularity complex network representation learning also needs to consider the dynamic network and adapt to the changes of network nodes, so as to realize the real-time information network analysis. social structure and network analysis network representation learning: a survey virtual network embedding: a survey the link-prediction problem for social networks community discovery using nonnegative matrix factorization node classification in social networks recommender systems deepwalk: online learning of social representations node2vec: scalable feature learning for networks line: large-scale information network embedding deep learning deep attributed network embedding structural deep network embedding semi-supervised classification with graph convolutional networks dgcc: data-driven granular cognitive computing granular computing data mining, rough sets and granular computing structural deep embedding for hypernetworks principal component analysis the isomap algorithm and topological stability laplacian eigenmaps for dimensionality reduction and data representation network representation learning based on multi-granularity structure word2vec explained: deriving mikolov et al'.s negativesampling word-embedding method harp: hierarchical representation learning for networks sparse autoencoder network representation learning with rich text information a general framework for content-enhanced network representation learning anrl: attributed network representation learning via deep neural networks granular computing with multiple granular layers for brain big data processing an approach for attribute reduction and rule generation based on rough set theory metapath2vec: scalable representation learning for heterogeneous networks grarep: learning graph representations with global structural information co-embedding attributed networks key: cord-017423-cxua1o5t authors: wang, rui; jin, yongsheng; li, feng title: a review of microblogging marketing based on the complex network theory date: 2011-11-12 journal: 2011 international conference in electrics, communication and automatic control proceedings doi: 10.1007/978-1-4419-8849-2_134 sha: doc_id: 17423 cord_uid: cxua1o5t microblogging marketing which is based on the online social network with both small-world and scale-free properties can be explained by the complex network theory. through systematically looking back at the complex network theory in different development stages, this chapter reviews literature from the microblogging marketing angle, then, extracts the analytical method and operational guide of microblogging marketing, finds the differences between microblog and other social network, and points out what the complex network theory cannot explain. in short, it provides a theoretical basis to effectively analyze microblogging marketing by the complex network theory. as a newly emerging marketing model, microblogging marketing has drawn the domestic academic interests in the recent years, but the relevant papers are scattered and inconvenient for a deep research. on the microblog, every id can be seen as a node, and the connection between the different nodes can be seen as an edge. these nodes, edges, and relationships inside form the social network on microblog which belongs to a typical complex network category. therefore, reviewing the literature from the microblogging marketing angle by the complex network theory can provide a systematic idea to the microblogging marketing research. in short, it provides a theoretical basis to effectively analyze microblogging marketing by the complex network theory. the start of the complex network theory dates from the birth of small-world and scale-free network model. these two models provide the network analysis tools and information dissemination interpretation to the microblogging marketing. "six degrees of separation" found by stanley milgram and other empirical studies show that the real network has a network structure of high clustering coefficient and short average path length [1] . watts and strogatz creatively built the smallworld network model with this network structure (short for ws model), reflecting human interpersonal circle focus on acquaintances to form the high clustering coefficient, but little exchange with strangers to form the short average path length [2] . every id in microblog has strong ties with acquaintance and weak ties with strangers, which matches the ws model, but individuals can have a large numbers of weak ties in the internet so that the online microblog has diversity with the real network. barabàsi and albert built a model by growth mechanism and preferential connection mechanism to reflect that the real network has degree distribution following the exponential distribution and power-law. because power-law has no degree distribution of the characteristic scale, this model is called the scale-free network model (short for ba model) [3] . exponential distribution exposes that most nodes have low degree and weak impact while a few nodes have high degree and strong impact, confirming "matthew effect" in sociology and satisfying the microblog structure that celebrities have much greater influence than grassroots, which the small-world model cannot describe. in brief, the complex network theory pioneered by the small-world and scalefree network model overcomes the constraints of the network size and structure of regular network and random network, describes the basic structural features of high clustering coefficient, short average path length, power-law degree distribution, and scale-free characteristics. the existing literature analyzing microblogging marketing by the complex network theory is less, which is worth further study. the complex network theory had been evoluted from the small-world scale-free model to some major models such as the epidemic model and game model. the diffusion behavior study on these evolutionary complex network models is valuable and can reveal the spread of microblogging marketing concept in depth. epidemic model divides the crowd into three basic types: susceptible (s), infected (i), and removed (r), and build models according to the relationship among different types during the disease spread in order to analyze the disease transmission rate, infection level, and infection threshold to control the disease. typical epidemic models are the sir model and the sis model. differences lie in that the infected (i) in the sir model becomes the removed (r) after recovery, so the sir model is used for immunizable diseases while the infected (i) in the sis model has no immunity and only becomes the susceptible (s) after recovery. therefore, the sis model is used for unimmunizable diseases. these two models developed other epidemic model: sir model changes to sirs model when the removed (r) has been the susceptible (s); sis model changes to si model presenting the disease outbreaks in a short time when the infected (i) is incurable. epidemic model can be widely seen in the complex network, such as the dissemination of computer virus [4] , information [5] , knowledge [6] . guimerà et al. finds the hierarchical and community structure in the social network [7] . due to the hierarchical structure, barthélemy et al. indicate that the disease outbreak followed hierarchical dissemination from the large-node degree group to the small-node degree group [8] . due to the community structure, liu et al. indicate the community structure has a lower threshold and greater steady-state density of infection, and is in favor of the infection [9] ; fu finds that the real interpersonal social network has a positive correlation of the node degree distribution, but the real interpersonal social network has negative [10] . the former expresses circles can be formed in celebrities except grassroots, but the latter expresses contacts can be formed in celebrities and grassroots on the microblog. the game theory combined with the complex network theory can explain the interpersonal microlevel interaction such as tweet release, reply, and retweet because it can analyze the complex dynamic process between individuals such as the game learning model, dynamic evolutionary game model, local interaction model, etc.(1) game learning model: individuals make the best decision by learning from others in the network. learning is a critical point to decision-making and game behavior, and equilibrium is the long-term process of seeking the optimal results by irrational individuals [11] . bala and goyal draw the "neighbor effect" showing the optimal decision-making process based on the historical information from individuals and neighbors [12] . (2) dynamic evolutionary game model: the formation of the social network seems to be a dynamic outcome due to the strategic choice behavior between edge-breaking and edge-connecting based on the individual evolutionary game [13] . fu et al. add reputation to the dynamic evolutionary game model and find individuals are more inclined to cooperate with reputable individuals in order to form a stable reputation-based network [14] . (3) local interaction model: local network information dissemination model based on the strong interactivity in local community is more practical to community microblogging marketing. li et al. restrain preferential connection mechanism in a local world and propose the local world evolutionary network model [15] . burke et al. construct a local interaction model and find individual behavior presents the coexistence of local consistency and global decentrality [16] . generally speaking, microblog has characteristics of the small-world, scale-free, high clustering coefficient, short average path length, hierarchical structure, community structure, and node degree distribution of positive and negative correlation. on one hand, the epidemic model offers the viral marketing principles to microblogging marketing, such as the sirs model can be used for the long-term brand strategy and the si model can be used for the short-term promotional activity; on the other hand, the game model tells microblogging marketing how to find opinion leaders in different social circles to develop strategies for the specific community to realize neighbor effect and local learning to form global microblog coordination interaction. rationally making use of these characteristics can preset effective strategies and solutions for microblogging marketing. the complex network theory is applied to biological, technological, economic, management, social, and many other fields by domestic scholars. zhou hui proves the spread of sars rumors has a typical small-world network features [17] . duan wenqi studies new products synergy diffusion in the internet economy by the complex network theory to promote innovation diffusion [18] . wanyangsong (2007) analyzes the dynamic network of banking crisis spread and proposes the interbank network immunization and optimization strategy [19] . although papers explaining microblogging marketing by the complex network theory have not been found, these studies have provided the heuristic method, such as the study about the online community. based on fu's study on xiao nei sns network [10] , hu haibo et al. carry out a case study on ruo lin sns network and conclude that the online interpersonal social network not only has almost the same network characteristics as the real interpersonal social network, but also has a negative correlation of the node degree distribution while the real interpersonal social network has positive. this is because the online interpersonal social network is more easier for strangers to establish relationships so that that small influence people can reach the big influence people and make weak ties in plenty through breaking the limited range of real world [20] . these studies can be used to effectively develop marketing strategies and control the scope and effectiveness of microblogging marketing. there will be a great potential to research on the emerging microblog network platform by the complex network theory. the complex network theory describes micro and macro models analyzing the marketing process to microblogging marketing. the complex network characteristics of the small-world, scale-free, high clustering coefficient, short average path length, hierarchical structure, community structure, node degree distribution of positive and negative correlation and its application in various industries provide theoretical and practical methods to conduct and implement microblogging marketing. the basic research idea is: extract the network topology of microblog by the complex network theory; then, analyze the marketing processes and dissemination mechanism by the epidemic model, game model, or other models while taking into account the impact of macro and micro factors; finally, find out measures for improving or limiting the marketing effect in order to promote the beneficial activities and control the impedimental activities for enterprizes' microblogging marketing. because the macro and micro complexity and uncertainty of online interpersonal social network, the previous static and dynamic marketing theory cannot give a reasonable explanation. based on the strong ties and weak ties that lie in individuals of the complex network, goldenberg et al. find: (1) after the external short-term promotion activity, strong ties and weak ties turn into the main force driving product diffusion; (2) strong ties have strong local impact and weak transmission ability, while weak ties have strong transmission ability and weak local impact [21] . therefore, the strong local impact of strong ties and strong transmission ability of weak ties are required to be rationally used for microblogging marketing. through system simulation and data mining, the complex network theory can provide explanation framework and mathematical tools to microblogging marketing as an operational guide. microblogging marketing is based on online interpersonal social network, having difference with the nonpersonal social network and real interpersonal social network. therefore, the corresponding study results cannot be simply mixed if involved with human factors. pastor-satorras et al. propose the target immunization solution to give protection priority to larger degree node according to sis scale-free network model [22] . this suggests the importance of cooperation with the large influential ids as opinion leaders in microblogging marketing. remarkably, the large influential ids are usually considered as large followers' ids on the microblog platform that can be seen from the microblog database. the trouble is, as scarce resources, the large influential ids have a higher cooperative cost, but the large followers' ids are not all large influential ids due to the online public relations behaviors such as follower purchasing and watering. this problem is more complicated than simply the epidemic model. the complex network theory can be applied in behavior dynamics, risk control, organizational behavior, financial markets, information management, etc.. microblogging marketing can learn the analytical method and operational guide from these applications, but the complex network theory cannot solve all the problems of microblogging marketing, mainly: 1. the complexity and diversity of microblogging marketing process cannot completely be explained by the complex network theory. unlike the natural life-like virus, individuals on microblog are bounded rational, therefore, the decisionmaking processes are impacted by not only the neighbor effect and external environment but also by individuals' own values, social experience, and other subjective factors. this creates a unique automatic filtering mechanism of microblogging information dissemination: information recipients reply and retweet the tweet or establish and cancel contact only dependent on their interests, leading to the complexity and diversity. therefore, interaction-worthy topics are needed in microblogging marketing, and the effective followers' number and not the total followers' number of id is valuable. this cannot be seen in disease infection. 2. there are differences in network characteristics between microblog network and the real interpersonal social network. on one hand, the interpersonal social network is different from the natural social network in six points: (1) social network has smaller network diameter and average path length; (2) social network has higher clustering coefficient than the same-scale er random network; (3) the degree distribution of social network has scale-free feature and follows power-law; (4) interpersonal social network has positive correlation of node degree distribution but natural social network has negative; (5) local clustering coefficient of the given node has negative correlation of the node degree in social network; (6) social network often has clear community structure [23] . therefore, the results of the natural social network are not all fit for the interpersonal social network. on the other hand, as the online interpersonal social network, microblog has negative correlation of the node degree distribution which is opposite to the real interpersonal social network. this means the results of the real interpersonal social network are not all fit for microblogging marketing. 3. there is still a conversion process from information dissemination to sales achievement in microblogging marketing. information dissemination on microblog can be explained by the complex network models such as the epidemic model, but the conversion process from information dissemination to sales achievement cannot be simply explained by the complex network theory, due to not only individual's external environment and neighborhood effect, but also consumer's psychology and willingness, payment capacity and convenience, etc.. according to the operational experience, conversion rate, retention rates, residence time, marketing topic design, target group selection, staged operation program, and other factors are needed to be analyzed by other theories. above all, microblogging marketing which attracts the booming social attention cannot be analyzed by regular research theories. however, the complex network theory can provide the analytical method and operational guide to microblogging marketing. it is believed that microblogging marketing on the complex network theory has a good study potential and prospect from both theoretical and practical point of view. the small world problem collective dynamics of 'small-world' networks emergence of scaling in random networks how viruses spread among computers and people information exchange and the robustness of organizational networks network structure and the diffusion of knowledge team assembly mechanisms determine collaboration network structure and team performance romualdo pastor-satorras, alessandro vespignani: velocity and hierarchical spread of epidemic outbreaks in scale-free networks epidemic spreading in community networks social dilemmas in an online social network: the structure and evolution of cooperation the theory of learning in games learning from neighbors a strategic model of social and economic networks reputation-based partner choice promotes cooperation in social networks a local-world evolving network model the emergence of local norms in networks research of the small-world character during rumor's propagation study on coordinated diffusion of new products in internet market doctoral dissertation of shanghai jiaotong university structural analysis of large online social network talk of the network: a complex systems look at the underlying process of word-of-mouth immunization of complex networks meeting strangers and friends of friends: how random are socially generated networks key: cord-024742-hc443akd authors: liu, quan-hui; xiong, xinyue; zhang, qian; perra, nicola title: epidemic spreading on time-varying multiplex networks date: 2018-12-03 journal: nan doi: 10.1103/physreve.98.062303 sha: doc_id: 24742 cord_uid: hc443akd social interactions are stratified in multiple contexts and are subject to complex temporal dynamics. the systematic study of these two features of social systems has started only very recently, mainly thanks to the development of multiplex and time-varying networks. however, these two advancements have progressed almost in parallel with very little overlap. thus, the interplay between multiplexity and the temporal nature of connectivity patterns is poorly understood. here, we aim to tackle this limitation by introducing a time-varying model of multiplex networks. we are interested in characterizing how these two properties affect contagion processes. to this end, we study susceptible-infected-susceptible epidemic models unfolding at comparable timescale with respect to the evolution of the multiplex network. we study both analytically and numerically the epidemic threshold as a function of the multiplexity and the features of each layer. we found that higher values of multiplexity significantly reduce the epidemic threshold especially when the temporal activation patterns of nodes present on multiple layers are positively correlated. furthermore, when the average connectivity across layers is very different, the contagion dynamics is driven by the features of the more densely connected layer. here, the epidemic threshold is equivalent to that of a single layered graph and the impact of the disease, in the layer driving the contagion, is independent of the multiplexity. however, this is not the case in the other layers where the spreading dynamics is sharply influenced by it. the results presented provide another step towards the characterization of the properties of real networks and their effects on contagion phenomena. social interactions take place in different contexts and modes of communication. on a daily basis we interact at work, in the family, and across a wide range of online platforms or tools, e.g., facebook, twitter, emails, mobile phones, etc. in the language of modern network science, social networks can be conveniently modeled and described as multilayer networks [1] [2] [3] [4] [5] . this is not a new idea. indeed, the intuition that social interactions are stratified in different layers dates back several decades [6] [7] [8] . however, the digitalization of our communications and the miniaturization of devices has just recently provided the data necessary to observe, at scale, and characterize the multilayer nature of social interactions. as in the study of single layered networks, the research on multilayer graphs is divided in two interconnected areas. the first deals with the characterization of the structural properties of such entities [1, 4] . one of the central observations is that the complex topology describing each type of interaction (i.e., each layer) might be different. indeed, the set and intensity of interactions in different contexts (e.g., work, family, etc.) or platforms (e.g., facebook, twitter, etc.) is not the same. nevertheless, layers are coupled by individuals active across two or more of them. the presence of such coupling as well as its * n.perra@greenwich.ac.uk degree is often referred to as multiplexity. another interesting feature of multilayer graphs is that the connectivity patterns in different layers might be topologically and temporally correlated [9] [10] [11] . the second area of research instead considers the function, such as sustaining diffusion or contagion processes, of multilayer networks [1, 12, 13] . a large fraction of this research aims at characterizing how the complex structural properties of multilayer graphs affect dynamical processes unfolding on their fabric. the first important observation is that disentangling connections in different layers gives rise to complex and highly nontrivial dynamics function of the interplay between interlayer and intralayer connections [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] . a complete summary of the main results in the literature is beyond the scope of the paper. we refer the interested reader to these recent resources for details [1] [2] [3] [4] [5] . despite the incredible growth of this area of network science over the last years, one particular aspect of multilayer networks is still largely unexplored: the interplay between multiplexity and the temporal nature of the connectivity patterns especially when dynamical processes unfolding on their fabric are concerned [13] . this should not come as a surprise. indeed, the systematic study of the temporal dynamics even in single layered graphs is very recent. in fact, the literature has been mostly focused on time-integrated properties of networks [27, 28] . as result, complex temporal dynamics acting at shorter timescales have been traditionally discarded. however, the recent technological advances in data storing and collection are providing unprecedented means to probe also the temporal dimension of real systems. the access to this feature is allowing to discover properties of social acts invisible in time aggregated data sets, and is helping characterize the microscopic mechanisms driving their dynamics at all timescales [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] . the advances in this arena are allowing to investigate the effects such temporal dynamics have on dynamical processes unfolding on time-varying networks. the study of the propagation of infectious diseases, ideas, rumors, or memes, etc., on temporal graphs shows a rich and nontrivial phenomenology radically different than what is observed on their static or annealed counterparts [29, . before going any further, it is important to notice how in their more general form, multilayer networks might be characterized by different types of nodes in each layer. for example, modern transportation systems in cities can be characterized as a multilayer network in which each layer captures a different transportation mode (tube, bus, public bikes, etc.) and the links between layers connect stations (nodes) where people can switch mode of transport [4, 12] . a particular version of multilayer networks, called multiplex, is typically used in social networks. here, the entities in each layer are of the same type (i.e., people). the interlayer links are drawn only to connect the same person in different layers. in this context, we introduce a model of time-varying multiplex networks. we aim to characterize the effects of temporal connectivity patterns and multiplexity on contagion processes. we model the intralayer evolution of connections using the activity-driven framework [29] . in this model of time-varying networks, nodes are assigned with an activity describing their propensity to engage in social interactions per unit time [29] . once active, a node selects a partner to interact. several selection mechanisms have been proposed, capturing different features of real social networks [30, 31, [62] [63] [64] . the simplest, that will be used here, is memoryless and random [29] . in these settings, active nodes do not have a preference towards individuals previously connected. despite such simplification, the mechanism allows capturing the heterogeneity of time-integrated connectivity patterns observed in real networks while guaranteeing mathematical tractability [29] . the multiplexity or coupling between layers is modulated by a probability p. if p = 1 all nodes are present in all layers. if p = 0, the multiplex is formed by m disconnected graphs. we consider p as a parameter and explore different regime of coupling between layers. furthermore, each layer is characterized by an activity distribution. we consider different scenarios in which the activity of coupling nodes, which are present in different layers (regulated by p), is uncorrelated as well as others in which it is instead positively or negatively correlated. in these settings, we study the unfolding of susceptible-infected-susceptible (sis) epidemic processes [65] [66] [67] . we derive analytically the epidemic threshold for two layers for any p and any distributions of activities. in the limit of p = 1 we find analytically the epidemic threshold for any number of layers. interestingly, the threshold is a not trivial generalization of the correspondent quantity in the monoplex (single layer network). in the general case 0 < p < 1 we found that the threshold is a decreasing function of p. positive correlations of coupling nodes push the threshold to smaller values with respect to the uncorrelated and negatively correlated cases. furthermore, when the average connectivity of two layers is very different, the critical behavior of the system is driven by the more densely connected layer. in such a scenario the epidemic threshold is not affected by the multiplexity, its value is equivalent to the case of a monoplex, and the coupling affects only the layer featuring the smaller average connectivity. the paper is organized as follows. in sec. ii we introduce the multiplex model. in sec. iii we study first both analytically and numerically the spreading of sis processes. finally, in sec. iv we discuss our conclusions. we first introduce the multiplex model. for simplicity of discussion, we consider the case in which the system is characterized by m = 2 layers a and b. however, the same approach can be used to create a multiplex with any number of layers. let us define n as the number of nodes in each layer. in general, we have three different categories of nodes: n a , n b , and n o . they describe, respectively, the number of nodes that are present only in layer a, b, or in both. the last category is defined by a parameter p: the coupling between layers (multiplexity). thus, on average, we have n a = n b = (1 − p)n and n o = pn. as mentioned in the introduction, the temporal dynamics in each layer is defined by the activity-driven framework [29] . thus, each noncoupling node is characterized by an activity extracted from a distribution f a (a) or f b (a) which captures its propensity to be engaged in a social interaction per unit time. observations in real networks show that the activity typically follows a heavy-tailed distribution [29] [30] [31] 41, 62, 68] . here, we assume that activities follow power laws, thus, f x (a) = c x a −γ x with x = [a, b] and a 1 to avoid divergences. coupling nodes instead are characterized by a joint activity distribution h(a a , a b ). as mentioned in the introduction, real multiplex networks are characterized by correlations across layers. in particular, the study of a wide range of real systems shows a complex and case dependent phenomenology in which the topological features (i.e., static connectivity patterns) of coupling nodes can be either positively or negatively correlated [9] . furthermore, researchers found evidence of positive temporal correlations between the activation patterns across layers [10, 11] . to account for such observations and explore their effects on spreading processes, we consider three simple prototypical cases in which the activities of coupling nodes in the two layers are (i) uncorrelated, or (ii) positively and (iii) negatively correlated. to simplify the formulation and to avoid adding other parameters, in case of positive and negative correlations we adopt the following steps. we first extract n o activities from the two distributions f x (a). then, we order them. in the case of positive correlation, a node that has the rth activity in a will be assigned to the correspondent activity in b. in other words, the first node will be assigned to the highest activity extracted from f a (a) and the highest value extracted from f b (a). the second will be assigned to the second highest activity extracted from both distributions, etc. in the case of negative correlations instead, a node that has the rth activity in a will be assigned the (pn − r + 1)th in b. in other words, the first node will be assigned to the highest activity in a and to the lowest activity in b. the second node will be assigned to the second highest activity in a and to the second lowest in b, etc. in these settings, the temporal evolution of the multiplex network is defined as follows. for each realization, we randomly select pn nodes as coupling nodes between layers. at each time step t, note the following: (i) each node is active with a probability defined by its activity. (ii) each active node creates m x links with randomly selected nodes. multiple connections with the same node in the same layer within each time step are not allowed. (iii) coupling nodes can be active and create connections in both layers. (iv) at time step t + t all connections are deleted and the process restarts from the first point. all connections have the same duration of t. in the following, we set, without lack of generality, t = 1. at each time the topology within each layer is characterized, mostly, by a set of disconnected stars of size m x + 1. thus, at the minimal temporal resolution each network looks very different than the static or annealed graphs we are used to seeing in the literature [69] . however, it is possible to show that, integrating links over t time steps in the limit in which t n, the resulting network has a degree distribution that follows the activity [29, 31, 70] . in other words, the heterogeneities in the activity distribution translate in heterogenous timeaggregated connectivity patterns typically of real networks. thus, as observed in real temporal networks the topological features at different timescales are very different than the late (or time-integrated) characteristics [38] . at each time step the average degree in each layer can be computed as where e x t is the number of links generated in each layer at each time step. furthermore, a x = da f x (a)a and a x o = da a da b h(a a , a b )a x are the average activity of noncoupling and coupling nodes in each layer, respectively. similarly, the total average degree (often called overlapping degree [71] ), at each time step, is thus, the average connectivity, at each time step, is determined by the number of links created in each layer, and by the interplay between the average activity of coupling and noncoupling nodes. as shown in fig. 1 (a), eq. (2) describes quite well the behavior of the average overlapping degree which is an increasing function of the multiplexity p. indeed, the larger the fraction of coupling nodes, the larger the connectivity of such nodes across layers. as we will see in the next section, this feature affects significantly the unfolding on contagion processes. in fig. 1 (b) we show the integrated degree distribution of the overlapping degree for different p. the plot clearly shows how the functional form is defined by the activity distributions of the two layers which in this case are equal. an increase in the fraction of coupling nodes does not change the distribution of the overlapping degree; it introduces a vertical shift which, however, is more visible for certain values of k. in order to understand how the interplay between multiplexity and temporal connectivity patterns affects dynamical processes, we consider sis contagion phenomena spreading on the multiplex model introduced in the previous section. in this prototypical epidemic model each node can be in one of two compartments. healthy nodes are susceptible to the disease and thus in the compartment s. infectious nodes instead join the compartment i . the natural history of the disease is defined as follows. a susceptible, in contact with an infected node, might get sick and infectious with probability λ. each infected node spontaneously recovers with rate μ, thus staying infectious for μ −1 time steps, on average. one crucial feature of epidemic models is the threshold which determines the conditions above which a disease is able to affect a macroscopic fraction of the population [65] [66] [67] . in case of sis models, below the threshold the disease dies out, reaching the so called disease-free equilibrium. above threshold instead, the epidemic reaches an endemic stationary state. this can be captured running the simulations for longer times and thus estimating the fraction of infected nodes for t → ∞: i ∞ . in general, in a multiplex network, such fraction might be different across layers. thus, we can define i x ∞ . to characterize the threshold we could study the behavior of such fraction(s) as function of λ/μ. indeed, the final number of infected nodes acts as order parameter in a second order phase transition, thus defining the critical conditions for the spreading [66] . however, due to the stochastic nature of the process, the numerical estimation of the endemic state, especially in proximity of the threshold, is not easy. thus, we adopt another method measuring the lifetime of the process l [72] . this quantity is defined as the average time the disease needs either to die out or to infect a macroscopic fraction y of the population. the lifetime acts as the susceptibility in phase transitions thus allows a more precise numerical estimation [72] . in the case of single layer activity-driven networks, in which partners of interactions are chosen at random and without memory of past connections, the threshold can be written as (see ref. [29] for details) thus, the conditions necessary for the spread of the disease are set by the interplay between the features of the disease (left side) and the dynamical properties of the time-varying networks where the contagion unfolds (right side). the latter are regulated by first and second moments of the activity distribution and by the number of connections created by each active node (i.e., m). it is important to notice that eq. (3) considers the case in which the timescale describing the evolution of the connectivity patterns and the epidemic process is comparable. the contagion process is unfolding on a time-varying network. in the case when links are integrated over time and the sis process spreads on a static or annealed version of the graph, the epidemic threshold will be much smaller [29, 73, 74] . this is due to the concurrency of connections which favors the spreading. in fact, by aggregating the connections over time, the degree of each node increases thus facilitating the unfolding of the disease. in this limit of timescale separation between the dynamics of and on networks, the evolution of the connectivity patterns is considered either much slower (static case) or much faster (annealed case) with respect to the epidemic process. in the following, we will only consider the case of comparable timescales. this is the regime of time-varying networks which is extremely relevant for a variety of spreading processes ranging from sexual transmitted diseases and influenzalike illnesses to rumors and information propagation [28] . what is the threshold in the case of our multiplex and timevarying network model? in the limit p = 0 the number of coupling nodes is zero. the two layers are disconnected, thus, the system is characterized by two independent thresholds regulated by the activity distributions of the two layers. the most interesting question is then what happens for p > 0. to find an answer to this conundrum, let us define i , respectively, as the number of noncoupling nodes of activity a and as the number of coupling nodes of activity a a and a b . the implicit assumption we are making by dividing nodes according to their activities is that of statistical equivalence within activity classes [66, 75] . in these settings, we can write the variation of the number of infected noncoupling nodes as function of time as where we omitted the dependence of time. the first term on the right-hand side considers nodes recovering, thus leaving the infectious compartment. the second and third terms account for the activation of susceptibles in activity class a (s x a = n x a − i x a ) that select infected nodes (noncoupling and coupling) as partners and get infected. the last two terms instead consider the opposite: infected nodes activate, select as partners noncoupling and coupling nodes in the activity class a infecting them as a result. similarly, we can write the expression for the variation of coupling nodes of activity classes a a and a b as the general structure of the equation is similar to the one we wrote above. the main difference is, however, that coupling nodes can be infected and can infect in both layers. the first term in the right-hand side accounts for the recovery process. the next four (two for each element in the sum in y) consider the activation of susceptible nodes that select as partners both noncoupling and coupling infected nodes and get infected. the last four terms account for the reverse process. in order to compute the epidemic threshold we need to define four auxiliary functions, thus defining a closed system of differential equations. in particular, we define x = da i x a a and o x = da a da b i o a a ,a b a x . for simplicity, we will skip the detailed derivation here (see the appendix for the details). by manipulating the previous three differential equations we can obtain four more, one for each auxiliary function. the condition for the spreading of the disease can be obtained by studying the spectral properties of the jacobian matrix of such system of seven differential equations. in particular, if the largest eigenvalue of the jacobian matrix is larger than zero, the system of equations will not be stable and, consequently, the number of infected nodes will increase. thus, the epidemic threshold can be obtained by studying the conditions for which this holds. as sanity check, let us consider first the limit p = 0. in this case, each layer acts independently and we expect the threshold of each to follow eq. (3). this is exactly what we find. in particular, the system of equations can be de-coupled in two different subsets (one per layer) which are governed by two jacobian matrices whose largest eigenvalues are where a n x = da f x (a)a n . thus, the spreading process will be able to affect a finite fraction of the total population in case either of these two eigenvalues is larger than zero, which implies λ μ > (m x a x + m x a 2 x ) −1 as expected. it is important to notice that in case of a multiplex network the disease might be able to spread in one layer but not in the other. however, in case the condition for the spreading is respected in both layers, they will experience the disease. let us consider the opposite limit: p = 1. as described in details in the appendix, the condition for the spreading of the disease reads as where a n interestingly, the threshold is a function of the first and second moments of the activity distributions of the coupling nodes which are modulated by the number of links each active node creates, plus a term which encodes the correlation of the activities of such nodes in the two layers. before showing the numerical simulations to validate the mathematical formulation, an important observation is in order. in this limit, effectively, we could think the multiplex as a multigraph: a single layer network with two types of edges. in case the joint probability distribution of activity is h(a a , a b ) = f (a a )δ(a b − a a ), thus two activities are exactly the same, and m a = m b the threshold reduces to eq. (3) (valid for a single layer network) in which the number of links created by active nodes is 2m. however, for a general form of the joint distribution and in case of different number of links created by each active node in different layers, this correspondence breaks down. in all the following simulations, we set n = 10 5 , = 10 −3 , μ = 0.015, y = 0.3, start the epidemic process from a 1% of nodes selected randomly as initial seeds, and show the averages of 10 2 independent simulations. in fig. 2 we show the first results considering a simple scenario in which m a = m b = 1 and the exponents for the distributions of activities are the same γ x = 2.1. the first observation is that in all three cases the analytical solutions (vertical dotted lines) agree with the results from simulations. the second observation is that in case of positive correlation between the activities of nodes in two layers, the threshold is significantly smaller than in the other two cases. this is not surprising as the nodes sustaining the spreading in both layers are the same. thus, effectively, active nodes are capable to infect the double number of other nodes. as we mentioned above, many real multiplexes are characterized by different types of positive correlations. when thinking at real outbreaks, the effect of such feature on the spreading process suggests quite a worrying scenario. however, it is important to remember that real multiplexes are sparse, thus characterized by values of multiplexity which are far from the limit p = 1 [9] . as we will see below, this aspect plays a crucial role in the more realistic cases of 0 < p < 1. the thresholds of the uncorrelated and negatively correlated cases are very similar. in fact, due to the heterogeneous nature of the activity distributions, except for few nodes in the tails, the effective difference between the activities matched in reverse or random order is not large, for the majority of nodes. in figs. 2(b) and 2(c) we show the behavior of the threshold as function of the activity exponents and the number of links created by active nodes in the two layers. for a given distribution of activity in a layer, increasing the exponent in the other (thus reducing the heterogeneity in the activity distribution) results in an increase of the threshold. this is due to the change of the first and second moments which decrease as a result of the reduced heterogeneity. in the settings considered here, if both exponents of activity distributions are larger than 2.6, the critical value of λ becomes larger than 1, as shown in fig. 2(b) . thus, in such region of parameters, the disease will not be able to spread. for a given number of links created in a layer by each active node, increasing the links created in the other layer results in a quite rapid reduction of the threshold. this is due to the increase of the connectivity and thus the spreading potential of active nodes. in the limit p = 1, we are able to obtain an expression for the threshold of an sis process unfolding on m layers. it is important to stress that this scenario is rather unrealistic. indeed, in real multiplexes the majority of nodes is present only in one or two layers [9] . nevertheless, we can argue that understanding the behavior of the threshold also in this case is of theoretical interest. with this observation in mind, the analytical condition for the spreading of the disease can be written as (see the appendix for details) where x = [a, b, . . . , z] and z > y implies an alphabetical ordering. the first observation is that in case h(a y , a z ) = f (a y )δ(a z − a y ) ∀ y, z ∈ x, thus the activity is the same for each node across each layer, eq. (8) reduces to which is the threshold for a single layer activity-driven network in which m → mm. this is the generalization of the correspondence between the two thresholds we discussed above for two layers. the second observation is that, in general, increasing the number of layers decreases the epidemic threshold. indeed, each new layer increases the connectivity potential of each node and thus the fragility of the system to the contagion process. figure 3(a) shows the analytical behavior of the epidemic threshold up to m = 10 for the simplest case of uncorrelated and positively correlated activities between layers confirming this result. in fig. 3(b) we show the comparison between the analytical results and the numerical simulations. the plot shows a perfect match between the two. furthermore, the two plots confirm the effects of positive correlations which facilitate the spreading of the disease. we now turn the attention to the most interesting and realistic cases which are different from the two limits of null and total coupling of nodes considered above. for a general value of p, we could not find a general closed expression for the epidemic threshold. however, the condition for the spreading can be obtained by investigating, numerically, the spectral properties of the jacobian (see the appendix for details). in fig. 4 we show the lifetime of sis spreading processes unfolding on a multiplex network for three different values of p. figure 4(a) shows the uncorrelated case and the dashed vertical lines describe the analytical predictions. the first observation is that the larger the multiplexity between two layers, the smaller the threshold. this should not come as a surprise. in fact, as shown previously in fig. 1 , the average connectivity in the system increases as a function of p. thus, increasing the fraction of nodes active in both layers increases the spreading power of such nodes when they get infected. the second observation is that the analytical predictions match remarkably well the simulations. the bottom panel shows instead the case of positive correlation between the activities of coupling nodes in the two layers. also in this case, the larger the multiplexity the smaller the epidemic threshold. the comparison between the two panels highlights it is important to notice that also here our analytical predictions match remarkably well the numerical simulations. in fig. 5 we show the behavior of the (analytical) epidemic threshold as function of p for three types of correlations. the results confirm what is discussed above. the larger the multiplexity, the smaller the threshold. negative and null correlations of coupling nodes exhibit very similar thresholds. instead, positive correlations push the critical value to smaller values. furthermore, the smaller the multiplexity, the smaller the effect of positive correlations as the difference between the thresholds increases as function of p. this is a reassuring result. indeed, as mentioned above, among the properties of real multiplex systems we find the presence of positive topological and temporal correlations as well as low values of multiplexity. the first feature favors the spreading of diseases. luckily instead, the second property reduces the advantage of positive correlations pushing the threshold to higher values which are closer to the case of negative and null correlations. it is also important to notice how the threshold of a multiplex network (p > 0) is always smaller than the threshold of a monoplex (p = 0) with the same features. indeed, the presence of coupling nodes effectively increases the spreading potential of the disease, thus reducing the threshold. however, the presence of few coupling nodes (p ∼ 0) does not significantly change the threshold; this result and the effect of multiplexity on the spreading power of diseases is in line with what was already discussed in the literature for static multiplexes [1, 22] . in fig. 6 , we show how the epidemic threshold varies when the average connectivity of the two layers is progressively different and asymmetric. in other words, we investigate what happens when one layer has a much larger average connectivity than the other. this situation simulates individuals engaged in two different social contexts, one characterized by fewer interactions (e.g., close family interactions) and one instead by many more connections (e.g., work environment). in the figure, we consider a multiplex network in which the layer a is characterized by m a = 1. we then let m b vary from 1 to 10 and measure the impact of this variation on the epidemic threshold for different values of p. for simplicity, we considered the case of uncorrelated activities in the two layers, but the results qualitatively hold also for the other types of correlations. few observations are in order. as expected, the case p = 0 is the upper bound of the epidemic threshold. however, the larger the asymmetry between the two layers, thus, the larger the average connectivity in the layer b, the smaller the effect of the multiplexity on the threshold. indeed, while systems characterized by m b = 1 and higher multiplexity feature a significantly smaller threshold with respect to the monoplex, for m b 3 such differences become progressively negligible and the effects of multiplexity vanish. in this regime, the layer with the largest average connectivity drives the spreading of the disease. the connectivity of layer b effectively determines the dynamics of the contagion, and thus the critical behavior is not influenced by coupling nodes. interestingly, this result generalizes to the case of time-varying systems which have been found in the case of static multiplexes. indeed, cozzo et al. [26] showed how the threshold of contact based processes is driven by the dominant layer which roughly corresponds to the layer featuring higher connectivity. in order to get a deeper understanding on this phenomenon, we show the asymptotic number of infected nodes in each layer for m a = 1 and m b = 10 in fig. 7 . for any λ above the threshold the fraction of infected nodes in layer b [ fig. 7(b) ] is larger than in layer a [ fig. 7(a) ] and is independent of the fraction of coupling nodes. as discussed above, in these settings the layer b is driving the contagion process and the imbalance between the connectivity patterns is large enough to behave as a monoplex. however, for layer a the contagion process is still highly influenced by p. indeed, as the fraction of coupling nodes increases, layer a is more and more influenced by the contagion process unfolding in b. overall, these results are qualitatively similar to the literature of spreading phenomena in static multilayer networks [14] . we presented a time-varying model of multiplex networks. the intralayer temporal dynamics follows the activity-driven framework which was developed for single layered networks (i.e., monoplexes). thus, nodes are endowed with an activity that describes their propensity, per unit time, to initiate social interactions. we define the multiplexity as a free parameter p regulating the fraction of coupling nodes between layers. the activities of such nodes are considered, in general, to be different but potentially correlated. in these settings, we studied how multiplexity and temporal connectivity patterns affect dynamical processes unfolding on such systems. to this end, we considered a prototypical model of infectious diseases: the sis model. we derived analytically the epidemic threshold of the process as function of p. in the limit p = 0, the system is constituted by disconnected networks that behave as monoplexes. in the opposite limit instead (i.e., p = 1) the epidemic threshold is a function of the first and second moments of the activity distributions as well as by their correlations across layers. we found that systems characterized by positive correlations are much more fragile to the spreading of the contagion process with respect to negative and null correlations. as several real multiplex systems feature positive topological and temporal correlations [9] [10] [11] , this result depicts a worrying scenario. luckily, real multiplexes are also sparse, thus characterized by multiplexity values far from limit p = 1 [9] . the threshold also varies as a function of the number of layers m. indeed, with perfect coupling each node is present and potentially active in each layer. thus, the larger m, the smaller the epidemic threshold as the spreading potential of each node increases. in the general and more realistic case 0 < p < 1, we could not find a closed expression for the epidemic threshold. however, the critical conditions for the spreading can be calculated from the theory by investigating numerically the spectral properties of the jacobian matrix describing the contagion dynamics. also in this case, positive correlations of activities across layers help the spreading by lowering the epidemic threshold, while negative and null correlations result in very similar thresholds. moreover, the lower the multiplexity, the larger the epidemic threshold. indeed, the case of disconnected monoplexes (i.e., p = 0) is the upper bound for the threshold. furthermore, the difference between the thresholds in the case of positive and the other two types of correlations decreases by lowering the multiplexity. considering the features of real multiplexes, this is a rather reassuring result. in fact, on one side the spreading is favored by positive correlations. on the other, the effect of such correlations is far less important for small values of multiplexity, which are typical of real systems. interestingly, the role of the multiplexity is drastically reduced in case the average connectivity in one layer is much larger than the other. in this scenario, which mimics the possible asymmetry in the contact patterns typical of different social contexts (e.g., family or work environment), one layer drives the contagion dynamics and the epidemic threshold is indistinguishable from the monoplex. however, the multiplexity is still significantly important in the other layer as the fraction of nodes present in both layers largely determines the spreading dynamics. some of these results are qualitatively in line with the literature of contagion processes unfolding on static and annealed multiplexes. however, as known in the case of single layered graphs, time-varying dynamics induces large quantitative differences [27, 28] . indeed, the concurrency and order of connections are crucial ingredients for the spreading and neglecting them, in favor of static and annealed representations, generally results in smaller thresholds. while the limits of timescale separation might be relevant to describe certain types of processes, they might lead to large overestimation of the spreading potential of contagion phenomena. the model presented here comes with several limitations. in fact, we considered the simplest version of the activitydriven framework in which, at each time step, links are created randomly. future work could explore the role of more realistic connectivity patterns in which nodes activate more likely a subset of (strong) ties and/or nodes are part of communities of tightly linked individuals. furthermore, we assumed that the activation process is poissonian and the activity of each node is not a function of time. future work could explore more realistic dynamics considering bursty activation and aging processes. all these features of real time-varying networks have been studied at length in the literature of single layered networks but their interplay with multiplexity when dynamical processes are concerned is still unexplored. thus, the results presented here are a step towards the understanding of the temporal properties of multiplex networks and their impact on contagion processes unfolding on their fabric. integrating over all activity spectrum of eq. (4), it obtains the following equation: thus, eq. (a1) can be further simplified as and integrating over all activity spectrum of eq. (5), it obtains the following equation: multiplying both sides of eq. (4) by a x , and integrating over all activity spectrum, we get the following equation: replacing x with a and b in eq. (a6), respectively, we have and in the same way, multiplying both sides of eq. (5) by a x and integrating over all activity spectrum, it obtains the following two equations: when x is replaced with a and b, respectively. when the system enters the steady state, we have . thus, the critical condition is determined by the following jacobian matrix: if the largest eigenvalue of j is larger than zero, the epidemic will outbreak. otherwise, the epidemic will die out. specifically, if p = 0, two layers are independent, thus regulated by two independent jacobians, and we can get the following two eigenvalues: for 0 < p < 1, we could not find a general analytical expression for the eigenvalues of j . however, the critical transmission rate can be determined by finding the value of λ leading the largest eigenvalue of j to zero. in other words, rather than solving explicitly the characteristic polynomial |j − i | = 0 and defining the condition for the spreading max > 0 as done above, we can determine the critical value of λ as the value corresponding to the largest eigenvalue to be zero [76, 77] . · · · · · · · · · · · · multilayer networks. structure and function multilayer networks mathematical formulation of multilayer networks the structure and dynamics of multilayer networks multiplex networks: basic formalism and structural properties social network analysis: methods and applications multiplexity in adult friendships social capital in the creation of human capital measuring and modeling correlations in multiplex networks quantifying dynamical spillover in co-evolving multiplex networks effects of temporal correlations in social multiplex networks the physics of spreading processes in multilayer networks spreading processes in multilayer networks diffusion dynamics on multiplex networks global stability for epidemic models on multiplex networks epidemic spreading and bond percolation on multilayer networks optimal percolation on multiplex networks resource control of epidemic spreading through a multilayer network dynamical interplay between awareness and epidemic spreading in multiplex networks epidemic spreading and risk perception in multiplex networks: a self-organized percolation method multiple routes transmitted epidemics on multiplex networks epidemics in partially overlapped multiplex networks immunization of epidemics in multiplex networks asymmetrically interacting spreading dynamics on complex layered networks conditions for viral influence spreading through multiplex correlated social networks contactbased social contagion in multiplex networks temporal networks modern temporal network theory: a colloquium activity driven modeling of time-varying networks time varying networks and the weakness of strong ties asymptotic theory of time-varying social networks with heterogeneous activity and tie allocation from calls to communities: a model for time-varying social networks the dynamical strength of social ties in information spreading persistence and periodicity in a dynamic proximity network what's in a crowd? analysis of face-to-face behavioral networks from seconds to months: an overview of multi-scale dynamics of mobile telephone calls persistence of social signatures in human communication fundamental structures of dynamic social networks face-to-face interactions random walks and search in time varying networks quantifying the effect of temporal resolution on time-varying networks controlling contagion processes in activity driven networks contagion dynamics in time-varying metapopulation networks epidemic spreading in time-varying community networks immunization strategies for epidemic processes in time-varying contact networks random walks on temporal networks analytical computation of the epidemic threshold on temporal networks causality driven slow-down and speed-up of diffusion in non-markovian temporal networks spatio-temporal networks: reachability, centrality and robustness random walk centrality for temporal networks importance of individual events in temporal networks bursts of vertex activation and epidemics in evolving networks attractiveness and activity in internet communities contrasting effects of strong ties on sir and sis processes in temporal networks committed activists and the reshaping of status-quo social consensus betweenness preference: quantifying correlations in the topological dynamics of temporal networks bursty communication patterns facilitate spreading in a threshold-based epidemic dynamics birth and death of links control disease spreading in empirical contact networks the basic reproduction number as a predictor for epidemic outbreaks in temporal networks statistical physics of vaccination social phenomena: from data analysis to models burstiness and tie reinforcement in time-varying social networks random walks on activity-driven networks with attractiveness epidemic spreading in modular time-varying networks modeling infectious disease in humans and animals modeling dynamical processes in complex socio-technical systems epidemic processes in complex networks the role of endogenous and exogenous mechanisms in the formation of r&d networks networks. an introduction topological properties of a time-integrated activity-driven network structural measures for multiplex networks nature of the epidemic threshold for the susceptible-infected-susceptible dynamics in networks telling tails explain the discrepancy in sexual partner reports concurrent partnerships and transmission dynamics in networks dynamical processes on complex networks a unified approach to percolation processes on multiplex networks clustering determines the dynamics of complex contagions in multiplex networks further, the maximum eigenvalue of matrix j m can be calculated asthus, the critical transmission rate isfurther, if the activities of the same node in each layer are the same, the above equation can be simplified as follows:(a20) key: cord-028685-b1eju2z7 authors: fuentes, ivett; pina, arian; nápoles, gonzalo; arco, leticia; vanhoof, koen title: rough net approach for community detection analysis in complex networks date: 2020-06-10 journal: rough sets doi: 10.1007/978-3-030-52705-1_30 sha: doc_id: 28685 cord_uid: b1eju2z7 rough set theory has many interesting applications in circumstances characterized by vagueness. in this paper, the applications of rough set theory in community detection analysis are discussed based on the rough net definition. we will focus the application of rough net on community detection validity in both monoplex and multiplex networks. also, the topological evolution estimation between adjacent layers in dynamic networks is discussed and a new community interaction visualization approach combining both complex network representation and rough net definition is adopted to interpret the community structure. we provide some examples that illustrate how the rough net definition can be used to analyze the properties of the community structure in real-world networks, including dynamic networks. complex networks have proved to be a useful tool to model a variety of complex systems in different domains including sociology, biology, ethology and computer science. most studies until recently have focused on analyzing simple static networks, named monoplex networks [7, 17, 18] . however, most of real-world complex networks are dynamics. for that reason, multiplex networks have been recently proposed as a mean to capture this high level complexity in real-world complex systems over time [19] . in both monoplex and multiplex networks the key feature of the analysis is the community structure detection [11, 19] . community detection (cd) analysis consists of identifying dense subgraphs whose nodes are densely connected within itself, but sparsely connected with the rest of the network [9] . cd in monoplex networks is a very similar task to classical clustering, with one main difference though. when considering complex networks, the objects of interest are nodes, and the information used to perform the partition is the network topology. in other words, instead of considering some individual information (attributes) like for clustering analysis, cd algorithms take advantage of the relational one (links). however, the result is the same in both: a partition of objects (nodes), which is called community structure [9] . several cd methods have been proposed for monoplex networks [7, 8, 12, [16] [17] [18] . also, different approaches have been recently emerged to cope with this problem in the context of multiplex networks [10, 11] with the purpose of obtaining a unique community structure involving all interactions throughout the layers. we can classify latter existing approaches into two broad classes: (i) by transforming into a problem of cd in simple networks [6, 9] or (ii) by extending existing algorithms to deal directly with multiplex networks [3, 10] . however, the high-level complexity in real-world networks in terms of the number of nodes, links and layers, and the unknown reference of classification in real domain convert the evaluation of cd in a very difficult task. to solve this problem, several quality measures (internal and external) have emerged [2, 13] . due to the performance may be judged differently depending on which measure is used, several measures should be used to be more confident in results. although, the modularity is the most widely used, it suffers the resolution limit problem [9] . another goal of the cd analysis is the understanding of the structure evolution in dynamic networks, which is a special type of multiplex that requires not only discovering the structure but also offering interpretability about the structure changes. rough set theory (rst), introduced by pawlak [15] , has often proved to be an excellent tool for analyzing the quality of information, which means inconsistency or ambiguity that follows from information granulation in a knowledge system [14] . to apply the advantages of rst in some fields of cd analysis, the goal of our research is to define the new rough net concept. rough net is defined starting from a community structure discovered by cd algorithms applied to monoplex or multiplex networks. this concept allows us obtaining the upper and lower approximations of each community, as well as, their accuracy and quality. in this paper, we will focus the application of the rough net concept on cd validity and topological evolution estimation in dynamic networks. also, this concept supports visualizing the interactions of the detected communities. this paper is organized as follows. section 2 presents the general concepts about the extended rst and its measures for evaluating decision systems. we propose the definition of rough net in sect. 3. section 4 explains the applications of rough net in the community detection analysis in complex networks. besides, a new approach for visualizing the interactions between communities based on rough net is provided in sect. 4. in sect. 5, we illustrate how the rough net definition can be used to analyze the properties of the community structure in real-world networks, including dynamic networks. finally, sect. 6 concludes the paper and discusses future research. the rough sets philosophy is based on the assumption that with every object of the universe u there is associated a certain amount of knowledge expressed through some attributes a used for object description. objects having the same description are indiscernible with respect to the available information. the indiscernibility relation r induces a partition of the universe into blocks of indiscernible objects resulting in information granulation, that can be used to build knowledge. the extended rst considers that objects which are not indiscernible but similar can be grouped in the same class [14] . the aim is to construct a similarity relation r from the relation r by relaxing the original indiscernibility conditions. this relaxation can be performed in many ways, thus giving many possible definitions for similarity. due to that r is not imposed to be symmetric and transitive, an object may belong to different similarity classes simultaneously. it means that r induces a covering on u instead of a partition. however, any similarity relation is reflexive. the rough approximation of a set x ⊆ u , using the similarity relation r , has been introduced as a pair of sets called rlower (r * ) and r -upper (r * ) approximations of x. a general definition of these approximations which can handle any reflexive r are defined respectively by eqs. (1) and (2). the extended rst offers some measures to analyze decision systems, such as the accuracy and quality of approximation and quality of classification measures. the accuracy of approximation of a rough set x, where |x| denotes the cardinality of x = ∅, offers a numerical characterization of x. equation (3) formalizes this measure such that 0 ≤ α(x) ≤ 1. if α(x) = 1, x is crisp (exact) with respect to the set of attributes, if α(x) < 1, x is rough (vague) with respect to the set of attributes. the quality of approximation formalized in eq. (4) expresses the percentage of objects which can be correctly classified into the class x. [14] . quality of classification expresses the proportion of objects which can be correctly classified in the system; equation (5) formalizes this coefficient where c 1 , · · · , c m correspond to the decision classes of the decision system ds. notice that if the quality of classification value is equal to 1, then ds is consistent, otherwise is inconsistent [14] . equation (6) shows the accuracy of classification, which measures the average the accuracy per classes with different importance levels. its weighted version is formalized in eq (7) [4] . m onoplex (simple) networks can be represented as graphs g = (v, e) where v represents the vertices (nodes) and e represents the edges (interactions) between these nodes in the network. m ultiplex networks have multiple layers, where each one is a monoplex network. formally, a multiplex network can be defined as a triplet < v, e, l > where e = e i such that e i corresponds to the interactions on layer i-th and l is the number of layers. this extension of graph model is powerful enough though to allow modeling different types of networks including dynamic and attributed networks [9] . cd algorithms exploit the topological structure for discovering a collection of dense subgraphs (communities). several multiplex cd approaches emphasize on how to obtain a unique community structure throughout all layers, by considering as similar nodes that ones with the same behavior in most of the layers [3, 10] . in the context of dynamic networks, the goal is to detect the conformation by layers for characterizing the evolutionary or stationary properties of the cd structures. due to the quality of the community structure may be judged differently depending on which measure is used, to be more confident in results several measures should be used [9] . in this section, we recall some basic notions related to the definition of the extension of rst in complex networks. also, we will focus on the introduction of the rough net concept by extrapolating these notions to the analysis of the consistency of the detected communities in complex networks. this concept supports to validate, visualize, interpret and understand the communities and also their evolution. besides, it has a potential application in labeling and refining the detected communities. as was mentioned, it is necessary to start from the definition of the decision system, the similarity relation, and the basic concepts of lower and upper approximations. we use a similarity relation r in our definition of rough net, because two nodes of v can be similar but not equal. the similarity class of the node x is denoted by r (x), as shown in eq. (8) . the r -lower and r -upper approximations for each similarity class are computed by eqs. (1) and (2) respectively. there is a variety of distances and similarities for comparing nodes [1] , such as salton, hub depressed index (hdi), hub promoted index (hpi), similarities based on the topological structure, and dice and cosine coefficients which capture the attribute relations. in this paper, we use the jaccard similarity for computing the similarities based on the topological structure because it has the attraction of simplicity and normalization. the jaccard similarity, which also allows us to emphasize the network topology necessary to apply rst in complex networks, is defined in eq. (9), where γ (x) denotes the neighborhood of the node x including it. r an adjacency tensor for a monoplex (i.e., single layer) network can be reduced to an adjacency matrix. the topological relation between nodes comprises an |v | × |v | adjacency matrix m , in which each entry m i,j indicates the relationships between nodes i and j weighted or not. the weight can be obtained as a result of the application of both a flattening process in a multi-relational network or a network construction schema when we want to apply network-based learning methods to vector-based datasets. if we apply some cd algorithm to this adjacency matrix, then we can consider the combination of the topological structure and the cd results as a decision system where a is a finite set of topological or non-topological features and d / ∈ a is the decision attribute resulting from the detected communities over the network. m ultiplex are powerful enough though to allow modeling different types of networks including multi-relational, attributed and dynamic networks [11] . note that multiplex networks explicitly incorporate multiple channels of connectivity in which entities can have a different set of neighbors in each layer. in a dynamic network each layer corresponds to the network state at a given time-stamp (or each layer represents a snapshot). like a time-series analysis, if attributes are captured in each time, a complex network can be represented as a dynamic network [19] . an adjacency tensor for a dynamic network with dimension l, which corresponds to the number of layers, represents a collection of adjacency matrices. the topological interaction between nodes within each layer k-th of a multiplex network comprises an |v | × |v | adjacency matrix m k , in which each entry m k ij indicates the relationships between nodes i and j in the k-th layer. if we apply a cd algorithm to the whole multiplex network topology by considering multiplex cd approaches [10, 19] in order to compute the unique final community structure, then we can consider the application of rst concepts over the multiplex network as the aggregation of the application of the rst concepts over each layer k-th. consequently, the decision system for the k-th layer is the combination of the topological structure m k and the cd results, formalized as where a k is a finite set of topological or non-topological features in the k-th layer and d / ∈ a is the decision attribute resulting from the detected communities in the multiplex topology (i.e., each node and their counterpart in each layer represent a unique node that belongs to a specific community). besides, it is possible to transform a multiplex into a monoplex network by a flattening process. the main flatten approaches are the binary flatten, the weighted flatten and another based on deep learning [10] . taking into account these variants, we can consider the combination of the topological structure of the transformed network and the cd results as a decision system ds monoplex = (v, a ∪ d), where a = k∈l a k is a finite set of topological or non-topological features that characterize the networks and d / ∈ a is the decision attribute resulting from the detected communities. the multiple instance or ensemble similarity measures are powerful for computing the similarity between nodes taking into account the similarity per layers (contexts). in this section, we describe the application of rough net in important tasks of the cd analysis: the validation and visualization of detected communities and their interactions, and the evolutionary estimation in dynamic networks. a community can be defined as a subgraph whose nodes are densely connected within itself, but sparsely connected with the rest of the network, though other patterns are possible. the existence of communities implies that nodes interact more strongly with the other members of their community than they do with nodes of the other communities. consequently, there is a preferential linking pattern between nodes of the same community (being modularity [13] one of the most used internal measures [9] ). this is the reason why link densities end up being higher within communities than between them. although the modularity is the most widely quality measure used in complex networks, it suffers the resolution limit problem [9] and, therefore, it is unable to judge in a correct way community structure of the networks with small communities or where communities may be very heterogeneous in size, especially if the network is large. several methods and measures have been proposed to detect and evaluate communities in both monoplex and multiplex networks [2, 3, 13] . as well as modularity, normalized mutual information (nmi), adjusted rand (ar), rand, variation of information (vi) measures [2] are widely used, but the latter ones need an obtain the similarity class r k (x) based on equation (8) 12: for x in c[k] do 13: calculate r k * (x) and r * k (x) approximations (see equations (1)-(2)) 14: calculate α(x) and γ(x) approximation measures (see equations (3) external reference classification to produce a result. however, it is very difficult to evaluate a community result because the major of complex networks occur in real world situations since reference classifications are usually not available. we propose to use quality, accuracy and weighted accuracy of classification measures described in sect. 2 to validate community results, taking into account the application of accuracy and quality of approximation measures to validate each community structure. aiming at providing more insights about the validation, we provide a general procedure based on rough net. notice that r k (x) is computed by considering the attributes or topological features of networks in the k-th layer, by using eq. (8). algorithm 1 allows us to measure the quality of the community structure using rough net, by considering the quality and precision of each community. rough net allows judging the quality of the cd by measuring the vagueness of each community. for that reason, if boundary regions are smaller, then we will obtain better results of quality, accuracy and weighted accuracy of classification measures. a huge of real-world complex networks are dynamic in nature and change over time. the change can be usually observed in the birth or death of interactions within the network over time. in a dynamic network is expected that nodes of the same community have a higher probability to form links with their partners than input: two-consecutive layers cl of g, a threshold ξ and a similarity s output: the evolutionary estimations obtain the similarity class r k−i (x) based on equation (8) 8: calculate r k−i * (x) and r * k−i (x) approximations (see equations (1)-(2)) 9: calculate α k−i (x) and γ k−i (x) measures (see equations (3) with other nodes [19] . for that reason, the key feature of the community detection analysis in dynamic networks is the evolution of communities over time. several methods have been proposed to detect these communities over time for specific time-stamp windows [3, 10] . often more than one community structure is required to judge if the network topology has suffered transformation over time for specific window size. to the best of our knowledge, there is no measure able which captures this aspect. for that reason, in this paper, we propose measures based on the average of quality, accuracy and weighted accuracy of classification for estimating in a real number the change level during a specific window timestamp. we need to consider two-consecutive layers for computing the quality, accuracy and weighted accuracy of classification measures in the evolutionary estimation (see algorithm 2) . for that reason, we need to apply twice the rough net concept for each pair of layers. the former rough net application is based on the decision system ds = (v, a k ∪ d k−1 ), where a k is a set of topological attributes in the layer k and d k−1 / ∈ a k is the result of the community detection algorithm in the layer k − 1 (decision attribute). the latter rough net application is based on the decision system ds = (v, a k−1 ∪ d k ), where a k−1 is a set of topological attributes in the layer k − 1 and d k / ∈ a k−1 is the result of the community detection algorithm in the layer k (decision attribute). the measures can be applied over a window size k by considering the aggregation of the quality classification between all pairs of consecutive (adjacency) layers. values nearer to 0 express the topology is evolving over time. in many applications more than a unique real value that expresses the quality of the community conformation is required for the understanding of the interactions throughout the networks. besides, real-world complex networks usually are input: a complex network g, detected communities, a threshold ξ and a similarity s output: community network representation 1: create an empty network g (v , e ) 2: for x in v do 3: obtain the similarity class r (x) based on equation (8) 4: end for 5: for x in communities(g, d) do 6: calculate r * (x) and r * (x) approximations (see equations (1)-(2)) 7: calculate α(x) and γ(x) approximation measures (see equations (3)-(4)) 8: add a new node x where the size corresponds to quality or accuracy 9: end for 10: for x, y in communities(g), x = y do 11: calculate the similarity sbn between communities x-th and y -th 12: add a new edge (i, y, wxy ) where the weighted wij = sbn (x, y ) 13: end for composed by many nodes, edges, and communities, making difficult to interpret the obtained results. thus, we propose a new approach for visualizing the interactions between communities taking into account the quality of the community structure by using the combination of the rough net definition and the complex network representation. our proposal, formalized in algorithm 3, allows us to represent the quality of the community structure in an interpretable way. the similarity measure used for weighted the interactions between communities in the network representation is formalized in eq. (10). the s bn (x, y ) captures the proportion of nodes members of the community x, which cannot be unambiguously classified into this community but belong to the community y and vice-versa. the above idea is computed based on the boundary region bn of both communities x and y . the rough net approach allows us to evaluate the interaction between the communities and its visualization facilitates interpretability. in turn, it helps experts redistribute communities and change granularity based on the application domain requirements. for illustrating the performance of the rough net definition in the community detection analysis, we apply it to three networks, two known to have monoplex topology and the third multiplex one. to be more confident in results, we should use several measures for judging the performance of a cd algorithm [2, 5] .thus, we compare our approach to validate detected communities (i.e., accuracy and quality of classification) with the most popular internal and external measures used for community detection validity: modularity, ar, nmi, rand, vi [2] . modularity [13] quantifies when the division is a good one, in the sense of having many within-community edges. it takes its largest value (1) in the trivial case where all nodes belong to a single community. a value near to 1 indicates strong community structure in the network. all other mentioned measures need external references for operating. all measures except vi, express the best result though values near to 1. for that reason, we use the notation vic for denoting the complement of vi measure (i.e., v ic = 1 − v i). zachary is the much-discussed network 1 of friendships between 34 members of a karate-club at a us university. figure 1 shows the community structures reported by the application of the standard cd algorithms label propagation (lp), multilevel louvain (lv), fast greedy optimization (fgo), leading eigenvector (ev), infomap (im) and walktrap (wt) to the zachary network. each community has been identified with a different colour. these algorithms detect communities, which mostly not correspond perfectly to the reference communities, except the lp algorithm which identically matches. for that reason, we can affirm that the lp algorithm reported the best division. however, in fig. 2 we can observe that the modularity values not distinguish the lp as the best conformation of nodes into communities, while the proposed accuracy and quality of classification measures based on the rough net definition, assign the higher value to the lp conformation regardless of the used threshold. on the other hand, our measures grant the lowest quality results for the community structure obtained by the ev algorithm as expected. notice that fgo and ev assign the orange node with high centrality in the orange community structure in a wrong manner. we can notice that most neighbors of this node are in another community. indeed, the fgo and wt are the following lowest results reported by our measures. figure 3 shows the performance reported by the application of the standard community detection algorithms before mentioned by using the proposed quality measures and the external ones. all measures exhibit the same monotony behaviors with independence of the selected similarity threshold ξ. our measures have the advantage that are internal and behave similarly to external measures. the jazz network 2 represents the collaboration between jazz musicians, where each node represents a jazz musician and interactions denote that two musicians are playing together in a band. six cd algorithms were applied to this network with the objective of subsequently exploring the behavior of validity measures. figure 4 displays that lp obtains a partition in which the number of interactions shared between nodes of different communities is smaller than the number of interactions shared between the communities obtained by the fgo algorithm. however, this behavior is not reflected in the estimation of the modularity values, while it manages to be captured by the proposed quality measures, as shown in fig. 5 . besides, the number of interactions shared between the communities detected by the algorithms lv, fgo, and ev is much greater than the number of interactions shared between the communities detected by the algorithms lp, wt, and im. therefore, this behavior was expected to be captured through the rough net definition. figure 5 shows that the results reported by our measures coincide with the expected results. on the one hand, we can observe that our quality measures exhibit a better performance than the modularity measure in this example. our measures also capture the presence of outliers, this is the reason why the community structure reported by the wt algorithm is higher than the obtained by the lp algorithm. caenorhabditis elegans connectome (celegans) is a multiplex network 3 that consists of layers corresponding to different synaptic junctions: electric (elec-trj), chemical monadic (monosyn), and polyadic (polysyn). figure 6 shows the mapping of the community structure in each network layer, which has been obtained by the application of the muxlod cd algorithm [10] . notice that a strong community structure result must correspond to a structure of densely connected subgraphs in each network layer. this reflexion property is not evident for these communities in the celegans network. for that reason, both the modularity and the proposed quality community detection measures obtain low results (modularity = 0.07, α(ξ = 0.25) = 0.24 and γ(ξ = 0.25) = 0.14). figure 7 shows the interactions between the communities in each layer by considering the muxlod community structure and the algorithm described in sect. 4.3. the community networks show high interconnections and as expected, the results of the quality measures are low. figure 7 shows that the topologies of the polysyn and electrj layers do not match exactly. in this sense, let us suppose without loss of generalization, that we want to estimate if there has been a change in the topology considering these layers as consecutive. to estimate these results, we apply the algorithm described in sect. 4.2. figure 8 shows the modularity, accuracy and quality of classification obtained values, which reflect that the community structure between layers does not completely match, so it can be concluded that the topology has evolved (changed). in this paper, we have described new quality measures for exploratory analysis of community structure in both monoplex and multiplex networks based on the rough net definition. the applications of rough net in community detection analysis demonstrate the potential of the proposed measures for judging the community detection quality. rough net allows us to asses the detected communities without requiring the referenced structure. besides, the proposed evolutionary estimation and the new approach for discovering the interactions between communities allows to the experts a deep understanding of complex real systems mainly based on the visualization of interactions. for the future work, we propose to extend the applications of rough net to the estimation of the community structure in the next time-stamp based on the refinement between adjacent layers in dynamic networks. a new scalable leader-community detection approach for community detection in social networks surprise maximization reveals the community structure of complex networks community detection in multidimensional networks rough text assiting text mining: focus on document clustering validity a novel community detection algorithm based on simplification of complex networks abacus: frequent pattern mining-based community discovery in multidimensional networks fast unfolding of communities in large networks finding community structure in very large networks mathematical formulation of multilayer networks muma: a multiplex network analysis library multiplex network mining: a brief survey finding community structure in networks using the eigenvectors of matrices mixture models and exploratory analysis in networks incomplete information: rough set analysis rough set theory and its applications to data analysis computing communities in large networks using random walks near linear time algorithm to detect community structures in large-scale networks maps of information flow reveal community structure in complex networks. arxiv preprint physics complex network approaches to nonlinear time series analysis key: cord-024552-hgowgq41 authors: zhang, ruixi; zen, remmy; xing, jifang; arsa, dewa made sri; saha, abhishek; bressan, stéphane title: hydrological process surrogate modelling and simulation with neural networks date: 2020-04-17 journal: advances in knowledge discovery and data mining doi: 10.1007/978-3-030-47436-2_34 sha: doc_id: 24552 cord_uid: hgowgq41 environmental sustainability is a major concern for urban and rural development. actors and stakeholders need economic, effective and efficient simulations in order to predict and evaluate the impact of development on the environment and the constraints that the environment imposes on development. numerical simulation models are usually computation expensive and require expert knowledge. we consider the problem of hydrological modelling and simulation. with a training set consisting of pairs of inputs and outputs from an off-the-shelves simulator, we show that a neural network can learn a surrogate model effectively and efficiently and thus can be used as a surrogate simulation model. moreover, we argue that the neural network model, although trained on some example terrains, is generally capable of simulating terrains of different sizes and spatial characteristics. an article in the nikkei asian review dated 13 september 2019 warns that both the cities of jakarta and bangkok are sinking fast. these iconic examples are far from being the only human developments under threat. the united nation office for disaster risk reduction reports that the lives of millions were affected by the devastating floods in south asia and that around 1,200 people died in the bangladesh, india and nepal [30] . climate change, increasing population density, weak infrastructure and poor urban planning are the factors that increase the risk of floods and aggravate consequences in those areas. under such scenarios, urban and rural development stakeholders are increasingly concerned with the interactions between the environment and urban and rural development. in order to study such complex interactions, stakeholders need effective and efficient simulation tools. a flood occurs with a significant temporary increase in discharge of a body of water. in the variety of factors leading to floods, heavy rain is one of the prevalent [17] . when heavy rain falls, water overflows from river channels and spills onto the adjacent floodplains [8] . the hydrological process from rainfall to flood is complex [13] . it involves nonlinear, time-varying interactions between rain, topography, soil types and other components associated with the physical process. several physics-based hydrological numerical simulation models, such as hec-ras [26] , lisflood [32] , lisflood-fp [6] , are commonly used to simulate floods. however, such models are usually computation expensive and expert knowledge is required for both design and for accurate parameter tuning. we consider the problem of hydrological modelling and simulation. neural network models are known for their flexibility, efficient computation and capacity to deal with nonlinear correlation inside data. we propose to learn a flood surrogate model by training a neural network with pairs of inputs and outputs from the numerical model. we empirically demonstrate that the neural network can be used as a surrogate model to effectively and efficiently simulate the flood. the neural network model that we train learns a general model. with the trained model from a given data set, the neural network is capable of simulating directly spatially different terrains. moreover, while a neural network is generally constrained to a fixed size of its input, the model that we propose is able to simulate terrains of different sizes and spatial characteristics. this paper is structured as follows. section 2 summarises the main related works regarding physics-based hydrological and flood models as well as statistical machine learning models for flood simulation and prediction. section 3 presents our methodology. section 4 presents the data set, parameters setting and evaluation metrics. section 5 describes and evaluates the performance of the proposed models. section 6 presents the overall conclusions and outlines future directions for this work. current flood models simulate the fluid movement by solving equations derived from physical laws with many hydrological process assumptions. these models can be classified into one-dimensional (1d), two-dimensional (2d) and threedimensional (3d) models depending on the spatial representation of the flow. the 1d models treat the flow as one-dimension along the river and solve 1d saint-venant equations, such as hec-ras [1] and swmm [25] . the 2d models receive the most attention and are perhaps the most widely used models for flood [28] . these models solve different approximations of 2d saint-venant equations. two-dimensional models such as hec-ras 2d [9] is implemented for simulating the flood in assiut plateau in southwestern egypt [12] and bolivian amazonia [23] . another 2d flow models called lisflood-fp solve dynamic wave model by neglecting the advection term and reduce the computation complexity [7] . the 3d models are more complex and mostly unnecessary as 2d models are adequate [28] . therefore, we focus our work on 2d flow models. instead of a conceptual physics-based model, several statistical machine learning based models have been utilised [4, 21] . one state-of-the-art machine learning model is the neural network model [27] . tompson [29] uses a combination of the neural network models to accelerate the simulation of the fluid flow. bar-sinai [5] uses neural network models to study the numerical partial differential equations of fluid flow in two dimensions. raissi [24] developed the physics informed neural networks for solving the general partial differential equation and tested on the scenario of incompressible fluid movement. dwivedi [11] proposes a distributed version of physics informed neural networks and studies the case on navier-stokes equation for fluid movement. besides the idea of accelerating the computation of partial differential equation, some neural networks have been developed in an entirely data-driven manner. ghalkhani [14] develops a neural network for flood forecasting and warning system in madarsoo river basin at iran. khac-tien [16] combines the neural network with a fuzzy inference system for daily water levels forecasting. other authors [31, 34] apply the neural network model to predict flood with collected gauge measurements. those models, implementing neural network models for one dimension, did not take into account the spatial correlations. authors of [18, 35] use the combinations of convolution and recurrent neural networks as a surrogate model of navier-stokes equations based fluid models with a higher dimension. the recent work [22] develops a convolutional neural network model to predict flood in two dimensions by taking the spatial correlations into account. the authors focus on one specific region in the colorado river. it uses a convolutional neural network and a conditional generative adversarial network to predict water level at the next time step. the authors conclude neural networks can achieve high approximation accuracy with a few orders of magnitude faster speed. instead of focusing on one specific region and learning a model specific to the corresponding terrain, our work focuses on learning a general surrogate model applicable to terrains of different sizes and spatial characteristics with a datadriven machine learning approach. we propose to train a neural network with pairs of inputs and outputs from an existing flood simulator. the output provides the necessary supervision. we choose the open-source python library landlab, which is lisflood-fp based. we first define our problem in subsect. 3.1. then, we introduce the general ideas of the numerical flood simulation model and landlab in subsect. 3.2. finally, we present our solution in subsect. 3.3. we first introduce the representation of three hydrological parameters that we use in the two-dimensional flood model. a digital elevation model (dem) d is a w × l matrix representing the elevation of a terrain surface. a water level h is a w × l matrix representing the water elevation of the corresponding dem. a rainfall intensity i generally varies spatially and should be a matrix representing the rainfall intensity. however, the current simulator assumes that the rainfall does not vary spatially. in our case, i is a constant scalar. our work intends to find a model that can represent the flood process. the flood happens because the rain drives the water level to change on the terrain region. the model receives three inputs: a dem d, the water level h t and the rainfall intensity i t at the current time step t. the model outputs the water level h t+1 as the result of the rainfall i t on dem d. the learning process can be formulated as learning the function l: physics-driven hydrology models for the flood in two dimensions are usually based on the two-dimensional shallow water equation, which is a simplified version of navier-stokes equations with averaged depth direction [28] . by ignoring the diffusion of momentum due to viscosity, turbulence, wind effects and coriolis terms [10] , the two-dimensional shallow water equations include two parts: conservation of mass and conservation of momentum shown in eqs. 1 and 2, where h is the water depth, g is the gravity acceleration, (u, v) are the velocity at x, y direction, z(x, y) is the topography elevation function and s fx , s fy are the friction slopes [33] which are estimated with friction coefficient η as for the two-dimensional shallow water equations, there are no analytical solutions. therefore, many numerical approximations are used. lisflood-fp is a simplified approximation of the shallow water equations, which reduces the computational cost by ignoring the convective acceleration term (the second and third terms of two equations in eq. 2) and utilising an explicit finite difference numerical scheme. the lisflood-fp firstly calculate the flow between pixels with mass [20] . for simplification, we use the 1d version of the equations in x-direction shown in eq. 3, the result of 1d can be directly transferable to 2d due to the uncoupled nature of those equations [3] . then, for each pixel, its water level h is updated as eq. 4, to sum up, for each pixel at location i, j, the solution derived from lisflood-fp can be written in a format shown in eq. 5, where h t i,j is the water level at location i, j of time step t, or in general as h t+1 = θ (d, h t , i t ) . however, the numerical solution as θ is computationally expensive including assumptions for the hydrology process in flood. there is an enormous demand for parameter tuning of the numerical solution θ once with high-resolution two-dimensional water level measurements mentioned in [36] . therefore, we use such numerical model to generate pairs of inputs and outputs for the surrogate model. we choose the lisflood-fp based opensource python library, landlab [2] since it is a popular simulator in regional two-dimensional flood studies. landlab includes tools and process components that can be used to create hydrological models over a range of temporal and spatial scales. in landlab, the rainfall and friction coefficients are considered to be spatially constant and evaporation and infiltration are both temporally and spatially constant. the inputs of the landlab is a dem and a time series of rainfall intensity. the output is a times series of water level. we propose here that a neural network model can provide an alternative solution for such a complex hydrology dynamic process. neural networks are well known as a collection of nonlinear connected units, which is flexible enough to model the complex nonlinear mechanism behind [19] . moreover, a neural network can be easily implemented on general purpose graphics processing units (gpus) to boost its speed. in the numerical solution of the shallow water equation shown in subsect. 3.2, the two-dimensional spatial correlation is important to predict the water level in flood. therefore, inspired by the capacity to extract spatial correlation features of the neural network, we intend to investigate if a neural network model can learn the flood model l effectively and efficiently. we propose a small and flexible neural network architecture. in the numerical solution eq. 5, the water level for each pixel of the next time step is only correlated with surrounding pixels. therefore, we use, as input, a 3 × 3 sliding window on the dem with the corresponding water levels and rain at each time step t. the output is the corresponding 3 × 3 water level at the next time step t + 1. the pixels at the boundary have different hydrological dynamic processes. therefore, we pad both the water level and dem with zero values. we expect that the neural network model learns the different hydrological dynamic processes at boundaries. one advantage of our proposed architecture is that the neural network is not restricted by the input size of the terrain for both training and testing. therefore, it is a general model that can be used in any terrain size. figure 1 illustrates the proposed architecture on a region with size 6 × 6. in this section, we empirically evaluate the performance of the proposed model. in subsect. 4.1, we describe how to generate synthetic dems. subsect. 4.2 presents the experimental setup to test our method on synthetic dems as a micro-evaluation. subsect. 4.3 presents the experimental setup on the case in onkaparinga catchment. subsect. 4.4 presents details of our proposed neural network. subsect. 4.5 shows the evaluation metrics of our proposed model. in order to generate synthetic dems, we modify alexandre delahaye's work 1 . we arbitrarily set the size of the dems to 64 × 64 and its resolution to 30 metres. we generate three types of dems in our data set that resembles real world terrains surface as shown in fig. 2a , namely, a river in a plain, a river with a mountain on one side and a plain on the other and a river in a valley with mountains on both sides. we evaluate the performance in two cases. in case 1, the network is trained and tested with one dem. this dem has a river in the valley with mountains on both sides, as shown in fig. 2a right. in case 2, the network is trained and tested with 200 different synthetic dems. the data set is generated with landlab. for all the flood simulations in landlab, the boundary condition is set to be closed on four sides. this means that rainfall is the only source of water in the whole region. the roughness coefficient is set to be 0.003. we control the initial process, rainfall intensity and duration time for each sample. the different initial process is to ensure different initial water level in the whole region. after the initial process, the system run for 40 h with no rain for stabilisation. we run the simulation for 12 h and record the water levels every 10 min. therefore, for one sample, we record a total of 72 time steps of water levels. table 1 summarises the parameters for generating samples in both case 1 and case 2. the onkaparinga catchment, located at lower onkaparinga river, south of adelaide, south australia, has experienced many notable floods, especially in 1935 and 1951. many research and reports have been done in this region [15] . we get two dem data with size 64 × 64 and 128 × 128 from the australia intergovernmental committee on surveying and mapping's elevation information system 2 . figure 2b shows the dem of lower onkaparinga river. we implement the neural network model under three cases. in case 3, we train and test on 64 × 64 onkaparinga river dem. in case 4, we test 64 × 64 onkaparinga river dem directly with case 2 trained model. in case 5, we test 128 × 128 onkaparinga river dem directly with case 2 trained model. we generate the data set for both 64 × 64 and 128 × 128 dem from landlab. the initial process, rainfall intensity and rain duration time of both dem are controlled the same as in case 1. the architecture of the neural network model is visualized as in fig. 1 . it firstly upsamples the rain input into 3 × 3 and concatenates it with 3 × 3 water level input. then, it is followed by several batch normalisation and convolutional layers. the activation functions are relu and all convolutional layers have the same size padding. the total parameters for the neural network are 169. the model is trained by adam with the learning rate as 10 −4 . the batch size for training is 8. the data set has been split with ratio 8:1:1 for training, validation and testing. the training epoch is 10 for case 1 and case 3 and 5 for case 2. we train the neural network model on a machine with a 3 ghz amd ryzen tm 7-1700 8-core processor. it has a 64 gb ddr4 memory and an nvidia gtx 1080ti gpu card with 3584 cuda cores and 11gb memory. the operating system is ubuntu 18.04 os. in order to evaluate the performance of our neural network model, we use global measurements metrics for the overall flood in the whole region. these metrics are global mean squared error: case 5 is to test the scalability of our model for the different size dem. in table 2b , for global performance, the mape of case 5 is around 50% less than both case 3 and case 4, and for local performance, the mape of case 5 is 34.45%. similarly, without retraining the existed model, the trained neural network from case 2 can be applied directly on dem with different size with a good global performance. we present the time needed for the flood simulation of one sample in landlab and in our neural network model (without the training time) in table 3 . the average time of the neural network model for a 64 × 64 dem is around 1.6 s, while it takes 47 s in landlab. furthermore, for a 128 × 128 dem, landlab takes 110 more time than the neural network model. though the training of the neural network model is time consuming, it can be reused without further training or tuning terrains of different sizes and spatial characteristics. it remains effective and efficient (fig. 4 ). we propose a neural network model, which is trained with pairs of inputs and outputs of an off-the-shelf numerical flood simulator, as an efficient and effective general surrogate model to the simulator. the trained network yields a mean absolute percentage error of around 20%. however, the trained network is at least 30 times faster than the numerical simulator that is used to train it. moreover, it is able to simulate floods on terrains of different sizes and spatial characteristics not directly represented in the training. we are currently extending our work to take into account other meaningful environmental elements such as the land coverage, geology and weather. hec-ras river analysis system, user's manual, version 2 the landlab v1. 0 overlandflow component: a python tool for computing shallow-water flow across watersheds improving the stability of a simple formulation of the shallow water equations for 2-d flood modeling a review of surrogate models and their application to groundwater modeling learning data-driven discretizations for partial differential equations a simple raster-based model for flood inundation simulation a simple inertial formulation of the shallow water equations for efficient two-dimensional flood inundation modelling rainfall-runoff modelling: the primer hec-ras river analysis system hydraulic userś manual numerical solution of the two-dimensional shallow water equations by the application of relaxation methods distributed physics informed neural network for data-efficient solution to partial differential equations integrating gis and hec-ras to model assiut plateau runoff flood hydrology processes and their variabilities application of surrogate artificial intelligent models for real-time flood routing extreme flood estimation-guesses at big floods? water down under 94: surface hydrology and water resources papers the data-driven approach as an operational real-time flood forecasting model analysis of flood causes and associated socio-economic damages in the hindukush region deep fluids: a generative network for parameterized fluid simulations fully convolutional networks for semantic segmentation optimisation of the twodimensional hydraulic model lisfood-fp for cpu architecture neural network modeling of hydrological systems: a review of implementation techniques physics informed data driven model for flood prediction: application of deep learning in prediction of urban flood development application of 2d numerical simulation for the analysis of the february 2014 bolivian amazonia flood: application of the new hec-ras version 5 physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations storm water management model-user's manual v. 5.0. us environmental protection agency hydrologic engineering center hydrologic modeling system, hec-hms: interior flood modeling decentralized flood forecasting using deep neural networks flood inundation modelling: a review of methods, recent advances and uncertainty analysis accelerating eulerian fluid simulation with convolutional networks comparison of the arma, arima, and the autoregressive artificial neural network models in forecasting the monthly inflow of dez dam reservoir lisflood: a gis-based distributed model for river basin scale water balance and flood simulation real-time waterlevel forecasting using dilated causal convolutional neural networks latent space physics: towards learning the temporal evolution of fluid flow in-situ water level measurement using nirimaging video camera acknowledgment. this work is supported by the national university of singapore institute for data science project watcha: water challenges analytics. abhishek saha is supported by national research foundation grant number nrf2017vsg-at3dcm001-021. key: cord-338127-et09wi82 authors: qin, bosheng; li, dongxiao title: identifying facemask-wearing condition using image super-resolution with classification network to prevent covid-19 date: 2020-09-14 journal: sensors (basel) doi: 10.3390/s20185236 sha: doc_id: 338127 cord_uid: et09wi82 the rapid worldwide spread of coronavirus disease 2019 (covid-19) has resulted in a global pandemic. correct facemask wearing is valuable for infectious disease control, but the effectiveness of facemasks has been diminished, mostly due to improper wearing. however, there have not been any published reports on the automatic identification of facemask-wearing conditions. in this study, we develop a new facemask-wearing condition identification method by combining image super-resolution and classification networks (srcnet), which quantifies a three-category classification problem based on unconstrained 2d facial images. the proposed algorithm contains four main steps: image pre-processing, facial detection and cropping, image super-resolution, and facemask-wearing condition identification. our method was trained and evaluated on the public dataset medical masks dataset containing 3835 images with 671 images of no facemask-wearing, 134 images of incorrect facemask-wearing, and 3030 images of correct facemask-wearing. finally, the proposed srcnet achieved 98.70% accuracy and outperformed traditional end-to-end image classification methods using deep learning without image super-resolution by over 1.5% in kappa. our findings indicate that the proposed srcnet can achieve high-accuracy identification of facemask-wearing conditions, thus having potential applications in epidemic prevention involving covid-19. coronavirus disease 2019 (covid19) is an emerging respiratory infectious disease caused by severe acute respiratory syndrome coronavirus 2 (sars-cov2) [1] . at present, covid-19 has quickly spread to the majority of countries worldwide, affecting more than 14.9 million individuals, and has caused 618,017 deaths, according to the report from the world health organization (who) on 23 july 2020 (https://covid19.who.int/). to avoid global tragedy, a practical and straightforward approach to preventing the spread of the virus is urgently desired worldwide. previous studies have found that facemask-wearing is valuable in preventing the spread of respiratory viruses [2] [3] [4] . for instance, the efficiencies of n95 and surgical masks in blocking the transmission of sars are 91% and 68%, respectively [5] . facemask-wearing can interrupt airborne viruses and particles effectively, such that these pathogens cannot enter the respiratory system of another person [6] . as a non-pharmaceutical intervention, facemask-wearing is a non-invasive and cheap method to reduce mortality and morbidity from respiratory infections. since the outbreak of covid-19, facemasks have been routinely used by the general public to reduce exposure to airborne pathogens in many countries [7] . in addition to patients suspected of actual infection with covid-19 being required to wear facemasks for the prevention of virus spreading, healthy persons also need to wear facemasks in order to protect themselves from infection [1] . facemasks, when fitted properly, effectively disrupt the forward momentum of particles expelled from a cough or sneeze, preventing disease transmission [5] . however, the effectiveness of facemasks in containing the spread of airborne diseases in the general public has been diminished, mostly due to improper wearing [8] . therefore, it is necessary to develop an automatic detection approach for facemask-wearing condition, which can contribute to personal protection and public epidemic prevention. the distinctive facial characteristics in facemask-wearing conditions provide an opportunity for automatic identification. recent rapid technological innovations in deep learning and computer vision have presented opportunities for development in many fields [9, 10] . as the main component of deep learning methods, deep neural networks (dnns) have demonstrated superior performance in many fields, including object detection, image classification, image segmentation, and distancing detection [11] [12] [13] [14] [15] [16] . one primary model of dnns is convolutional neural networks (cnns), which have been widely used in the field of computer vision tasks. after training, cnns can recognize and classify facial images-even with slight differences-due to their powerful feature extraction capability. as one of the cnns, image super-resolution (sr) networks can restore image details. recently, sr networks have become more in-depth, and the ideas of auto-encoder and residual learning have been integrated for performance improvement [17, 18] . sr networks have also been applied for image processing before image segmentation or classification, reconstructing images for higher resolution and restoring details [19] [20] [21] [22] [23] . moreover, sr networks can improve the classification accuracy significantly, especially when using a dataset with low-quality images, and provide a feasible solution to improve facemask-wearing condition identification performance. therefore, the combination of an sr network with a classification network (srcnet) could be utilized in facial image classification for accuracy improvement. to our knowledge, there have not been any published reports related to sr networks combined with classification networks for accuracy improvement in facial image classification, especially regarding the automatic detection of facemask-wearing conditions. therefore, we intend to develop a novel method combining an sr network with a classification network (srcnet) to identify facemask-wearing conditions, in order to improve classification accuracy with low-quality facial images. our main contributions can be summarized as follows. (1) development of a new face accessory identification method that combines an sr network with a classification network (srcnet) for facial image classification. (2) utilization of a deep learning method for automatic identification of facemask-wearing conditions. to our knowledge, this is the first time a deep learning method has been applied to identifying facemask-wearing condition. (3) improving the sr network structure, including activation functions and the density of skip connections, which outperforms previous state-of-the-art methods. the idea of reconstructing high-quality images from low-resolution images has a long history. bicubic was one of the most widely used methods, which up-sampled low-resolution images by linear interpolation in both the x-axis and y-axis. however, the reconstructed images using the bicubic method were blurred, due to the loss of high-frequency information. hence, high-performance algorithms have been introduced. yang, et al. [24] presented an sr method based on sparse representation, which used sparse representations for each patch of the low-resolution input and then calculated the coefficients to generate a high-resolution output. the example-based sr method was introduced by timofte, et al. [25] , which reconstructs images based on a dictionary of low-resolution and high-resolution exemplars. recently, deep learning methods have also been introduced for sr [26] [27] [28] [29] [30] [31] [32] [33] [34] . dong, et al. [35] first presented the srcnn, which utilized a three-layer cnn for image super-resolution, after which more high-performance network structures have been introduced for sr, such as vdsr [29] and red [17] ; vdsr increases the depth of the cnn in sr and proposes residual learning for fast training, while red introduces symmetric convolutional layers and deconvolutional layers with skip connections for better performance. deep learning methods have outperformed traditional image classification networks in many aspects, especially using the cnn algorithm. tuning cnns for better accuracy has been an area of intensive research over the past several years, and some high-performance cnn architectures (e.g., alexnet [13] , vggnet [36] , googlenet [37] , and resnet [38] ) have been introduced. recently, the tuning of a cnn has progressed in two separate ways: one drawing representational power from deeper or wider architectures by increasing the number of trainable parameters (e.g., inception-v4 [39] , xception [40] , and densenet [41] ), while other research has focused on building small and efficient cnns due to limitations in computational power (e.g., mobilenet [42] , mobilenet-v2 [43] , shufflenet [44] , and squeezenet [45] ). all of these network structures outperformed the traditional machine learning methods, such as histograms of oriented gradient (hog)-based support vector machines (svm) [46] and k-nearest neighbors (knn), in classification tasks using either the imagenet classification dataset [47] or the cifar-10 classification dataset [48] . as cnns have become deeper and wider, overfitting problems have been raised, mainly due to the limitations of datasets, which are detrimental to the generalization of networks. to prevent overfitting, one way is to change the architecture of neural networks, for example, by adding dropout layers [49] . some studies have focused on the hyper-parameters in training options and adding regularization terms [38, 50, 51] . data augmentations such as random rotation, random cropping, and random reflections have also been widely applied for prevention of the overfitting problem [13, 38, 41, 42, 52 ]. upscaling low-quality (low-resolution, blurred) input images to produce high-quality feature maps to improve classification performance is one of the most popular ways for low-quality image classification or object detection [53] [54] [55] [56] [57] . na, et al. [56] introduced an sr method on cropped regions or candidates to improve object detection and classification performances. cai, et al. [57] has developed a resolution-aware convolutional deep model combining super-resolution and classification. sr was also applied in facial recognition. zou, et al. [58] adopted sr to improve facial recognition performance on low-resolution images, proving that the combination of sr and a facial recognition model concurrently allows for increased recognition performance. uiboupin, et al. [59] adopted sr using sparse representation for improving face recognition in surveillance monitoring. however, these sr methods for improving face recognition accuracy are either based on facial features or high-level representations. there have not been any published reports related to deep-learning-based sr networks combined with classification networks for accuracy improvement in facial image classification, especially regarding the automatic detection of facemask-wearing conditions. hence, the srcnet, combining a sr network and classification network, is proposed and utilized in facial recognition. there is plenty of research using image features, machine learning, or deep learning methods for face accessories detection, especially in the area of glasses and hat detection. jing, et al. [60] used image edge information in a small area between the eyes for glasses detection. machine learning methods like svm and knn were also widely applied in face accessories detection [61] [62] [63] . recently, the deep learning methods have become more prevalent in face accessories detection, where high-level and abstract information could be extracted through cnns [64, 65] . however, as one of the most common face accessories, there is a paucity in automatic facemask-wearing condition identification, especially using the deep learning method. hence, the srcnet is proposed to identify facemask-wearing condition, which has the application value, especially in epidemic prevention involving covid-19. this section describes the technology behind the srcnet and facemask-wearing condition identification, including the proposed algorithm, image pre-processing, facial detection and cropping, sr network, facemask-wearing condition identification network, datasets, and training details. facemask-wearing condition identification is a kind of three-category classification problem, including no facemask-wearing (nfw), incorrect facemask-wearing (ifw), and correct facemask-wearing (cfw). our goal is to form a facemask-wearing condition identification function, fwi(x), which inputs an unprocessed image and outputs the conditions of wearing facemasks for all faces in the image. figure 1 offers the diagram of the proposed algorithm, which contains three main steps: image pre-processing, facial detection and cropping, and srcnet for sr and facemask-wearing condition identification. after the pre-processing of raw images, all facial areas of images are detected using a multitask cascaded convolutional neural network [12] . the facial areas are then cropped, where the sizes of the cropped images vary. all cropped images are then sent to srcnet for facemask-wearing condition identification. in srcnet, all images are judged for the need of sr. as the size of the input images for the facemask-wearing condition identification network is 224 × 224, cropped images with a size no larger than 150 × 150 (i.e., width or length no more than 150) are sent to the sr network, and then for facemask-wearing condition identification. otherwise, the cropped images are then directly sent for facemask-wearing condition identification. the output is the probabilities of the input images with respect to the three categories: nfw, ifw, and cfw. after passing through the classifier, the pipeline outputs the final facemask-wearing condition results. sensors 2020, 20, x for peer review 4 of 23 especially using the deep learning method. hence, the srcnet is proposed to identify facemaskwearing condition, which has the application value, especially in epidemic prevention involving covid-19. this section describes the technology behind the srcnet and facemask-wearing condition identification, including the proposed algorithm, image pre-processing, facial detection and cropping, sr network, facemask-wearing condition identification network, datasets, and training details. facemask-wearing condition identification is a kind of three-category classification problem, including no facemask-wearing (nfw), incorrect facemask-wearing (ifw), and correct facemaskwearing (cfw). our goal is to form a facemask-wearing condition identification function, fwi(x), which inputs an unprocessed image and outputs the conditions of wearing facemasks for all faces in the image. figure 1 offers the diagram of the proposed algorithm, which contains three main steps: image pre-processing, facial detection and cropping, and srcnet for sr and facemask-wearing condition identification. after the pre-processing of raw images, all facial areas of images are detected using a multitask cascaded convolutional neural network [12] . the facial areas are then cropped, where the sizes of the cropped images vary. all cropped images are then sent to srcnet for facemask-wearing condition identification. in srcnet, all images are judged for the need of sr. as the size of the input images for the facemask-wearing condition identification network is 224 × 224, cropped images with a size no larger than 150 × 150 (i.e., width or length no more than 150) are sent to the sr network, and then for facemask-wearing condition identification. otherwise, the cropped images are then directly sent for facemask-wearing condition identification. the output is the probabilities of the input images with respect to the three categories: nfw, ifw, and cfw. after passing through the classifier, the pipeline outputs the final facemask-wearing condition results. the goal of image pre-processing is to improve the accuracy of the following facial detection and facemask-wearing condition identification steps. srcnet is designed to be applied in public for classification, taking uncontrolled 2d images as input. the raw images taken in real-life have the goal of image pre-processing is to improve the accuracy of the following facial detection and facemask-wearing condition identification steps. srcnet is designed to be applied in public for classification, taking uncontrolled 2d images as input. the raw images taken in real-life have considerable variance in contrast and exposure, so image pre-processing is needed to ensure the accuracy of facial detection and facemask-wearing condition identification [66] . from our experiment, the face detector is likely to make errors when images are underexposed. the raw images were adjusted, using the matlab image processing toolbox, by mapping the values of the input intensity image to the new value, in which 1% of the values are saturated at low and high intensities of the input data. the image pre-processing diagram and corresponding histogram are illustrated in figure 2 . sensors 2020, 20, x for peer review 5 of 23 considerable variance in contrast and exposure, so image pre-processing is needed to ensure the accuracy of facial detection and facemask-wearing condition identification [66] . from our experiment, the face detector is likely to make errors when images are underexposed. the raw images were adjusted, using the matlab image processing toolbox, by mapping the values of the input intensity image to the new value, in which 1% of the values are saturated at low and high intensities of the input data. the image pre-processing diagram and corresponding histogram are illustrated in figure 2 . as srcnet needs to concentrate on the information from faces, rather than the background, in order to improve accuracy, a face detector is needed for the detection of faces and to crop facial areas. the uncontrolled 2d images have differences in face size, expression, and background. hence, a robust and highly accurate face detector is needed. the multitask cascaded convolutional neural network was adopted for facial detection, which has been shown to perform well in obtaining facial areas in real environments [12] . after obtaining the position of the face, faces are then cropped from the pre-processed image, to serve as the inputs of the sr network or facemask-wearing condition identification network, depending on image sizes. image sizes no more than 150 × 150 (width or length no more than 150) were first input to the sr network, and then for facemask-wearing condition identification. other cropped facial images were directly sent to the facemask-wearing condition identification network. examples of cropped images are shown in figure 3 . as srcnet needs to concentrate on the information from faces, rather than the background, in order to improve accuracy, a face detector is needed for the detection of faces and to crop facial areas. the uncontrolled 2d images have differences in face size, expression, and background. hence, a robust and highly accurate face detector is needed. the multitask cascaded convolutional neural network was adopted for facial detection, which has been shown to perform well in obtaining facial areas in real environments [12] . after obtaining the position of the face, faces are then cropped from the pre-processed image, to serve as the inputs of the sr network or facemask-wearing condition identification network, depending on image sizes. image sizes no more than 150 × 150 (width or length no more than 150) were first input to the sr network, and then for facemask-wearing condition identification. other cropped facial images were directly sent to the facemask-wearing condition identification network. examples of cropped images are shown in figure 3 . the first stage of srcnet is the sr network. the cropped facial images have a huge variance in size, which could possibly damage the final identification accuracy of srcnet. hence, sr is applied before classification. the structure of the sr network was inspired by red [17] , which uses convolutional layers as an auto-encoder and deconvolutional layers for image up-sampling. symmetric skip connections were also applied to preserve image details. the detailed architectural information of the sr network is shown in figure 4 . the sr network has five convolutional layers and six deconvolutional layers. except for the final deconvolutional layer, all other convolutional layers are connected to their corresponding convolutional layers by skip connections. with skip connections, the information is propagated from convolutional feature maps to the corresponding deconvolutional layers and from input to output. the network is then fitted by solving the residual of the problem, which is denoted as where is the ground truth, is the input image, and ( ) is the function of the sr network for the ℎ image. in convolutional layers, the number of kernels was designed to increase by a factor of 2. with kernels size 4 × 4 and stride 2, after passing through the first convolutional layer for feature extraction, every time the image passes through a convolutional layer, the size of the feature maps decreases by a factor of ½ . hence, the convolutional layers act as an auto-encoder and extract features from the input image. in the deconvolutional layers, the number of output feature maps is symmetric to the corresponding convolutional layers, in order to satisfy the skip connections. the number of kernels in every deconvolutional layer decreases by a factor of ½ (except for the final deconvolutional layer), while, with kernels size 4 × 4 and stride 2, the size of feature maps increases by a factor of 2. after information combination in the final deconvolutional layer, the output is an image with the same size as the input image. the deconvolutional layers act as a decoder, which take the output of the encoder as input and up-sample them to obtain a super-resolution image. the first stage of srcnet is the sr network. the cropped facial images have a huge variance in size, which could possibly damage the final identification accuracy of srcnet. hence, sr is applied before classification. the structure of the sr network was inspired by red [17] , which uses convolutional layers as an auto-encoder and deconvolutional layers for image up-sampling. symmetric skip connections were also applied to preserve image details. the detailed architectural information of the sr network is shown in figure 4 . the first stage of srcnet is the sr network. the cropped facial images have a huge variance in size, which could possibly damage the final identification accuracy of srcnet. hence, sr is applied before classification. the structure of the sr network was inspired by red [17] , which uses convolutional layers as an auto-encoder and deconvolutional layers for image up-sampling. symmetric skip connections were also applied to preserve image details. the detailed architectural information of the sr network is shown in figure 4 . the sr network has five convolutional layers and six deconvolutional layers. except for the final deconvolutional layer, all other convolutional layers are connected to their corresponding convolutional layers by skip connections. with skip connections, the information is propagated from convolutional feature maps to the corresponding deconvolutional layers and from input to output. the network is then fitted by solving the residual of the problem, which is denoted as where is the ground truth, is the input image, and ( ) is the function of the sr network for the ℎ image. in convolutional layers, the number of kernels was designed to increase by a factor of 2. with kernels size 4 × 4 and stride 2, after passing through the first convolutional layer for feature extraction, every time the image passes through a convolutional layer, the size of the feature maps decreases by a factor of ½ . hence, the convolutional layers act as an auto-encoder and extract features from the input image. in the deconvolutional layers, the number of output feature maps is symmetric to the corresponding convolutional layers, in order to satisfy the skip connections. the number of kernels in every deconvolutional layer decreases by a factor of ½ (except for the final deconvolutional layer), while, with kernels size 4 × 4 and stride 2, the size of feature maps increases by a factor of 2. after information combination in the final deconvolutional layer, the output is an image with the same size as the input image. the deconvolutional layers act as a decoder, which take the output of the encoder as input and up-sample them to obtain a super-resolution image. the sr network has five convolutional layers and six deconvolutional layers. except for the final deconvolutional layer, all other convolutional layers are connected to their corresponding convolutional layers by skip connections. with skip connections, the information is propagated from convolutional feature maps to the corresponding deconvolutional layers and from input to output. the network is then fitted by solving the residual of the problem, which is denoted as where gt i is the ground truth, i i is the input image, and f i (x) is the function of the sr network for the i th image. in convolutional layers, the number of kernels was designed to increase by a factor of 2. with kernels size 4 × 4 and stride 2, after passing through the first convolutional layer for feature extraction, every time the image passes through a convolutional layer, the size of the feature maps decreases by a factor of 1 /2. hence, the convolutional layers act as an auto-encoder and extract features from the input image. in the deconvolutional layers, the number of output feature maps is symmetric to the corresponding convolutional layers, in order to satisfy the skip connections. the number of kernels in every deconvolutional layer decreases by a factor of 1 /2 (except for the final deconvolutional layer), while, with kernels size 4 × 4 and stride 2, the size of feature maps increases by a factor of 2. after information combination in the final deconvolutional layer, the output is an image with the same size as the input image. the deconvolutional layers act as a decoder, which take the output of the encoder as input and up-sample them to obtain a super-resolution image. it is worth mentioning that the function used for down-and up-sampling is the stride in the convolutional and deconvolutional layers-rather than pooling and un-pooling layers-as the aim of the sr network is to restore image details rather than learning abstractions (pooling and un-pooling layers damage the details of images and deteriorate the restoration performance [17] . the function of the final deconvolutional layer is to combine all the information from the previous deconvolutional layer and input image and normalize all pixels to [0, 1] as the output. the stride for the final deconvolutional layer was set to 1, for information combination without up-sampling. the activation function of the final deconvolutional layer is clipped rectified linear unit, which forces normalization of the output and avoids error in computing the loss. the definition of clipped rectified linear unit is as follows: where x is the input value. one main difference between our model and red is the improvement in the activation functions, which was changed from a rectified linear unit (relu) to a leaky rectified linear unit (leakyrelu) for all convolutional and deconvolutional layers except the final deconvolutional layer, which use clipped rectified linear unit as the activation function to limit values in the range [0, 1]. previous studies have shown that different activation functions have an impact on the final performance of a cnn [67, 68] . hence, the improvement in the activation functions contributed to the better image restoration by the sr network. the relu and leakyrelu are defined as follows: where x is the input value and α is a scale factor. the reason for this improvement was that the skip connections propagated the image from input to output. for an sr network, the network shall have the capability to subtract or add values for pixels, where the relu function can only add values for feature maps. the leakyrelu function, however, can activate neurons with negative values, thus improving the performance of the network. another difference is the density of skip connections. rather than using skip connections every few (e.g., two in red) layers from convolutional layers to their symmetrical deconvolutional feature maps, the density of skip connections increased, and all convolutional layers were connected to their mirrored deconvolutional layers. the reason for this was to cause all layers to learn to solve the residual problem, which reduced the loss of information between layers while not significantly increasing the network parameters. the goal of sr network training is to update all learnable parameters to minimize the loss. for sr networks, the mean squared error (mse) is widely used as the loss function [17, 27, 34, 35] . a regularization term (weight decay) for the weights is added to the mse loss to reduce overfitting. the mes with l 2 regularization was applied as the loss function loss(w), which is defined as where gt i is the ground truth, o i is the output image, and loss(w) is the loss for collections of given w. it is worth mentioning that the size of the input image can be arbitrary and that the output image has the same size as the input image. the convolutional and deconvolutional layers are symmetric for the sr network. furthermore, the network is predicted pixel-wise. for better detail enhancement, we chose a dedicated image input size for sr network training and, so, the input images were resized to 224 × 224 × 3 with bicubic interpolation (which was the same as the input image size of the facemask-wearing condition identification network). the output of the sr network is enhanced images with the same size of the inputs (224 × 224 × 3), and the enhanced images will be sent directly to the facemask-wearing condition identification network for classification. the second stage of srcnet is facemask-wearing condition identification. as cnns are one of the most common types of network for image classification, which perform well in facial recognition, a cnn was adopted for the facemask-wearing condition identification network in the second stage of srcnet. the goal was to form a function g(fi), where fi is the input face image, which outputs the probabilities of the three categories (i.e., nfw, ifw, and cfw). the classifier then outputs the classification result based on the output possibilities. mobilenet-v2 was applied as the facemask-wearing condition identification network, which is a lightweight cnn that can achieve high accuracy in image classification. the main features of mobilenet-v2 are residual blocks and depthwise separable convolution [42, 43] . the residual blocks contribute to the training of the deep network, addressing the gradient vanishing problem and achieving benefits by back-propagating the gradient to the bottom layers. as for facemask-wearing condition identification, there are slight differences between ifw and cfw. hence, the capability of feature extraction or the depth of the network are essential, contributing to the final identification accuracy. depthwise separable convolution is applied for the reduction of computation and model size while maintaining the final classification accuracy, which separable convolution splits into two layers: one layer for filtering and another layer for combining. transfer learning is applied in the network training procedure, which is a kind of knowledge migration between the source and target domains. the network is trained in three steps: initialization, forming a general facial recognition model, and knowledge transfer to facemask-wearing condition identification. the first step is initialization, which contributes to the final identification accuracy and training speed [38, 69] . then, a general facial recognition model is formed using a large facial image dataset, where the network gains the capability of facial feature extraction. after watching millions of faces, the network then concentrates on facial information, rather than the interference from backgrounds and the differences caused by image shooting parameters. the final step is knowledge transfer between facial recognition and facemask-wearing condition identification. the final fully connected layer is modified to meet with category requirements of facemask-wearing condition identification. the reason for adopting transfer learning was the considerable differences in data volumes and their consequences. the facemask-wearing condition identification dataset is relatively small, compared to general facial recognition datasets, which may cause overfitting problems and a reduction in identification accuracy during the training process. hence, the network gains knowledge about faces in the general facial recognition model training process for the reduction in overfitting and the improvement in accuracy. the final stage of the classifier is the softmax function, which calculates the probabilities of all classes using the outputs of its direct ancestor (i.e., fully connected layer neurons) [70] . the definition is: where x i is the total input received by unit i and p i is the prediction probability of the image belonging to class i. the training goal of the facemask-wearing condition identification network was to minimize the cross-entropy loss with weight decay. for image classification, cross-entropy is widely used as the loss function [71, 72] . a regularization term (weight decay) can help to significantly avoid overfitting. hence, cross-entropy with l 2 regularization was applied as the loss function loss r , defined as for the cross-entropy term, n is the number of samples, k is the number of classes, t ij is the indicator that the i th sample belongs to the j th class (which is 1 when labels correspond and 0 when they are different), and y ij is the output for sample i for class j, which is the output value of the softmax layer. for the cross-entropy term, w is the learned parameters in every learned layer and λ is the regularization factor (coefficient). different facial image data sets were used for different network training for the improvement in the generalization ability of srcnet. the public facial image dataset celeba was processed and used for sr network training [73] . as the goal of the sr network was detail enhancement, a large and high-resolution facial image data set was needed; celeba met these requirements. the processing of celeba included three steps: image pre-processing, facial detection and cropping, and image selection. all raw images were pre-processed as mentioned above. the facial areas were then detected by the multitask cascaded convolutional neural network and cropped for training, as the sr network was designed for restoring detailed information of faces rather than the background. cropped images that were smaller than 224 × 224 (i.e., the input size of the facemask-wearing condition identification network) or non-rgb images were discarded automatically. all other cropped facial images were inspected manually and images with blur or dense noise were also discarded. finally, 70,534 high-resolution facial images were split into a training dataset (90%) and a testing dataset (10%) and were adopted for sr network training and testing. training of the facemask-wearing condition identification network comprised three steps. each step used a different data set for training. for initialization, the goal was generalization. a large-scale classification data set was needed for better generalization and, so, the imagenet dataset was adopted for network initialization [13] . during this procedure, non-zero values were assigned to parameters, which increased the generalization ability. furthermore, proper initialization significantly improves the training speed and better informs the general facial recognition model. the general facial recognition model was trained with a large-scale facial recognition database, the casia webface facial dataset [74] . all images were screened manually and those containing insufficient subjects or with poor image quality were discarded [74] . finally, the large-scale facial recognition dataset contained 493,750 images with 10,562 subjects, which was split into a training data set (90%) and testing data set (10%). the training set was applied for general facial recognition model training. the public facemask-wearing condition dataset medical masks dataset (https://www.kaggle.com/ vtech6/medical-masks-dataset) was applied for fine-tuning the network, in order to transfer knowledge from general facial recognition to facemask-wearing condition identification. the 2d rgb images were taken in uncontrolled environments, and all faces in the data set had their position co-ordinates with facemask-wearing condition labels. the medical masks data set was processed in four steps: facial cropping and labeling, label confirmation, image pre-processing, and sr. all faces were cropped and labeled using the given position coordinates and labels. all cropped facial images were then screened manually and those with incorrect labels were discarded. then, the facial images were confirmed and pre-processed using the methods mentioned in section 3.2. for the final accuracy of srcnet, the data set was expanded for the case of not wearing a mask. the resolution of pre-processed images varied, as shown in table 1 . for accuracy improvement of the facemask-wearing condition identification network, the facial image must contain enough details. hence, the sr network was applied to add details to low-quality images. images of sizes no larger than 150 × 150 (i.e., width or length no more than 150) were processed using the sr network. finally, the dataset contained 671 images of nfw, 134 images of ifw, and 3030 images of cfw. the whole dataset was separated into a training dataset (80%) and a testing dataset (20%) for facemask-wearing condition identification network training and testing. the training of srcnet contained two main steps: sr network training and facemask-wearing condition identification network training. for sr network training, the training goal was to restore facial details, which we used the training set of celeba to achieve. based on the characteristics of the medical masks dataset, the input images were pre-processed to imitate the low-quality images in the medical masks dataset. the high-resolution processed images in celeba were first filtered with a gaussian filter with a kernel size of 5 × 5 and a standard deviation of 10. then, they were down-sampled to 112 × 112. as the size of the input and output was the same, the down-sampled images were then up-sampled to 224 × 224 with bicubic as input, with the same size as the input of the facemask-wearing condition identification network. adam was adopted as the optimizer, with β 1 = 0.9, β 2 = 0.999, and = 10 −8 [75] . the network was trained for 200 epochs with an initial learning rate of 10 −4 and with a learning rate dropping factor of 0.9 every 20 epochs. the mini-batch size was 48. the first step of facemask-wearing condition identification network training was initialization. the network was trained using the imagenet dataset, with the training parameters proposed in [43] . the second step was to form a general facial recognition model. the output classes were modified to match with the class numbers (10,562). for initialization, the weight and bias in the final modified fully connected layer were initialized using a normal distribution with 0 mean and 0.01 standard deviation. the network was trained for 50 epochs, with the training data set shuffled in every epoch. to increase the training speed, the learning rate drop was 0.9 for every 6 epochs with an initial learning rate of 10 −4 , which eliminated the problem of the loss becoming stable. the network was trained using adam as the optimizer, with β 1 = 0.9, β 2 = 0.999, = 10 −8 , and 10 −4 weight decay for l 2 regularization, in order to avoid overfitting [75] . transfer learning was applied for fine-tuning the facemask-wearing condition identification network, where the final fully connected layer and classifier were modified to match the classes (nfw, ifw, and cfw). the weights and biases in the final modified layer were initialized by independently sampling from a normal distribution with zero mean and 0.01 standard deviation, which produced superior results, compared to other initializers. adam was chosen as the optimizer, while the learning rate was set as 10 −4 . to avoid overfitting, a 10 −4 weight decay for l 2 regularization was also applied [75] . the batch size was set to 16 and the network was trained for 8 epochs in total. the grid search method was applied to search for the best combination of all the parameters mentioned above, in order to improve the performance of the facemask-wearing condition identification network. data augmentation can reduce the overfitting problem and contribute to the final accuracy of the network [36, 52, 76] . to train the general facial recognition network, the training dataset was randomly rotated in a range of 10 • (in a normal distribution), shifted vertically and horizontally in a range of 8 pixels, and horizontally flipped in every epoch. during the fine-tuning stage, the augmentation was mild, with rotation within 6 • (in normal distribution), shifting by up to 4 pixels (vertically and horizontally), and with a random horizontal flip in every epoch. srcnet was implemented by matlab with the deep learning and image processing toolboxes for network training and image processing. a single nvidia graphics processing unit (gpu) with the nvidia cuda deep neural network library (cudnn) and compute unified device architecture (cuda) was applied to implement srcnet. for sr networks, the most widely used full-reference quality metrics are peak signal-to-noise ratio (psnr) and structural similarity index (ssim) [77] . the psnr was used as the metric for quantitatively evaluating image restoration quality, while ssim compared local patterns of pixel intensities for luminance and contrast. comparisons with previous state-of-the-art methods, including red [17] , srcnn [35] , vdsr [29] , lanczos [78] , and bicubic, were made to illustrate the performance of the proposed sr network. all the methods were trained on the training set of celeba (if needed) and tested on the testing set. as in real applications, the quality of images varied. low-quality images were mainly manifested in resolution and blur. hence, different-quality images were simulated and used for testing the performance of the sr network, as carried out by changing the standard deviation σ, the size of gaussian filters, and the resolutions. for testing with different standard deviations of gaussian filters, the testing set was first filtered with gaussian filters with a kernel size of 5 × 5 and standard deviations of 5, 10, 15, and 20, and then down-sampled to 112 × 112 for evaluation. for testing with different kernel sizes of gaussian filters, the testing set was first filtered with gaussian filters with kernel sizes of 3 × 3, 5 × 5, 7 × 7, and 9 × 9, and a standard deviation of 10, then down-sampled to 112 × 112 for evaluation. for testing with different image resolutions, the testing set was first filtered with a gaussian filter with a kernel size of 5 × 5 and a standard deviation of 10, then down-sampled to 64 × 64, 96 × 96, 112 × 112, and 150 × 150 for evaluation. the sizes of input images were the same as the outputs of the sr network. for evaluation of the effect of the sr network on the facemask-wearing condition identification network, which takes 224 × 224 images as input, all down-sampled testing sets were up-sampled to 224 × 224 as the input of the sr network. the evaluation results are shown in tables 2-4 . compared to previous state-of-the-art methods, the proposed sr network performed better, especially in terms of ssim. as it can be observed from table 4 , after the size of the image reached 150 × 150, the performance of the network decreased. the reason for this was that the network was trained to restore blurred images with low resolution. with the increase in image resolution, the resolution and detail of facial images increased, which undermined the condition of using the network. hence, only images with a size no larger than 150 × 150 (width or length no more than 150) were processed with the sr network. in this case, the sr network significantly outperformed bicubic. as the images in the medical masks dataset have a considerable variance in resolution, the sr network had to have good performance under different resolutions. hence, different sr methods were compared and visualized with different resolutions of small and blurred images [79] . the testing image was first blurred with a gaussian filter with a kernel size of 5 × 5 and a standard deviation of 10, then down-sampled to 64 × 64, 96 × 96, 112 × 112, and 150 × 150, respectively, before restoration. the visualized results are shown in figure 5 . although all sr methods enhanced facial details, the proposed sr network outperformed other methods in all resolutions. the images restored by the proposed sr network were closer to the ground truth, due to its high psnr and ssim values. sensors 2020, 20, x for peer review 13 of 23 to illustrate the advantages and reason for using mobilenet-v2 as the facemask-wearing condition identification network, comparisons with other cnns, including inception-v3 [80] , densenet201 [41] , resnet50 [38] , darknet19 [81] , xception [40] , and vgg19 [36] , in terms of network parameters and running time for a single image, were conducted, as shown in table 5 . generally, the performance of a network increases with the depth of the network. mobilenet-v2 showed great performance for real-time identification, with low storage space and running time. in addition, the depth of mobilenet-v2 was deep, which contributed to its final performance in identifying facemaskwearing condition. from our experiment, mobilenet-v2 did not show a performance decrease compared to other networks, with the final facemask-wearing condition identification accuracy gap being less than 1%, compared to the other networks with the sr network. all experiments were conducted with matlab 2020a, a i7 cpu, and p600 gpu with 4 gb memory. to illustrate the advantages and reason for using mobilenet-v2 as the facemask-wearing condition identification network, comparisons with other cnns, including inception-v3 [80] , densenet201 [41] , resnet50 [38] , darknet19 [81] , xception [40] , and vgg19 [36] , in terms of network parameters and running time for a single image, were conducted, as shown in table 5 . generally, the performance of a network increases with the depth of the network. mobilenet-v2 showed great performance for real-time identification, with low storage space and running time. in addition, the depth of mobilenet-v2 was deep, which contributed to its final performance in identifying facemask-wearing condition. from our experiment, mobilenet-v2 did not show a performance decrease compared to other networks, with the final facemask-wearing condition identification accuracy gap being less than 1%, compared to the other networks with the sr network. all experiments were conducted with matlab 2020a, a i7 cpu, and p600 gpu with 4 gb memory. running time: average time consumption for a single image classification, which is image processing time by the neural network. image or network loading time is not under consideration. all networks performed the image classification task 20 times with the same bunch of images, and the times taken were averaged. after training, srcnet was tested using the testing set of the medical masks dataset. the proposed algorithm was tested using an ablation experiment. the comparison in accuracy and the confusion matrix of srcnet are reported. the ablation experiment was designed to illustrate the importance of transfer learning and the sr network. the performances of srcnet with or without transfer learning or the proposed sr network were compared, as shown in table 6 . transfer learning and the sr network increased the identification accuracy considerably, by reducing the overfitting problem and increasing facial details, respectively. finally, srcnet reached an accuracy of 98.70% and outperformed mobilenet-v2 without transfer learning or the sr network by over 1.5% in kappa. 1: proposed srcnet. 2: proposed srcnet without sr network, which was an end-to-end facemask-wearing condition identification network with transfer learning. 3: proposed srcnet without transfer learning. 4: proposed srcnet without transfer learning or sr network, which was an end-to-end facemask-wearing condition identification network without transfer learning. all other settings, including hyper parameters, dataset, and implement details remained the same. accuracy: accuracy in three categories classification (nfw, ifw, and cfw). facemask-wearing accuracy: accuracy in wearing a facemask (facemask-wearing, nfw). personal protection accuracy: accuracy in having good personal protection (fail to have personal protection, including nfw and ifw, having personal protection, two categories classification). κ: kappa in three categories classification. the confusion matrices were measured and are shown in figure 6 . the testing data set contained facial images of nfw, ifw, and cfw. the method we proposed correctly classified 767 images (with only 10 prediction errors), thus outperforming those without transfer learning or the sr network in every category. sensors 2020, 20, x for peer review 15 of 23 the identification result of different facemasks is illustrated in table 7 . there are generally two types of facemasks: medical surgical mask and basic cloth face mask, where the facemasks are close to the faces; folded facemasks and the n95 type, where some space is between the face and facemask. an example of these two types of facemasks is demonstrated in figure 7 . the result shows that the srcnet identifies different types of facemask-wearing conditions with high accuracy. 1: folded facemasks and n95 type. 2: medical surgical masks and basic cloth face masks. accuracy: accuracy in three categories classification (nfw, ifw, and cfw). facemask-wearing accuracy: accuracy in wearing a facemask (facemask-wearing, nfw). personal protection accuracy: accuracy in having good personal protection (fail to have personal protection, including nfw and ifw, having personal protection, two categories classification). κ: kappa in three categories classification. the performance of srcnet in different colors of facemasks is also measured, as shown in table 8 . blue, white, and black are the three most common color for facemasks, while some masks are other colors like green or gray, or patterned. the srcnet can identify facemask-wearing condition with different facemask colors, which means that the srcnet is robust. the identification result of different facemasks is illustrated in table 7 . there are generally two types of facemasks: medical surgical mask and basic cloth face mask, where the facemasks are close to the faces; folded facemasks and the n95 type, where some space is between the face and facemask. an example of these two types of facemasks is demonstrated in figure 7 . the result shows that the srcnet identifies different types of facemask-wearing conditions with high accuracy. the performance of srcnet in different colors of facemasks is also measured, as shown in table 8 . blue, white, and black are the three most common color for facemasks, while some masks are other colors like green or gray, or patterned. the srcnet can identify facemask-wearing condition with different facemask colors, which means that the srcnet is robust. examples of identification results are shown in figures 7 and 8 . although the face positions and types of facemasks vary, srcnet correctly identified all facemask-wearing conditions with high confidence. as analyzed from failed cases, the critical states (wearing facemask between cfw and ifw), image quality, and blocked faces were the three main reasons for identification errors. the ways of wearing facemasks were continuous variables, while the classification results were discrete; hence, critical states were one of the main causes of misidentification. besides, when the image quality was low (e.g., low-resolution, blocking artifacts, ringing effects, and blurring) or when the faces were partly occluded by objects or other faces, srcnet had a higher error rate. in addition, srcnet was likely to make bias errors when the color of a facemask was close to the facial skin color. examples of identification results are shown in figures 7 and 8 . although the face positions and types of facemasks vary, srcnet correctly identified all facemask-wearing conditions with high confidence. as analyzed from failed cases, the critical states (wearing facemask between cfw and ifw), image quality, and blocked faces were the three main reasons for identification errors. the ways of wearing facemasks were continuous variables, while the classification results were discrete; hence, critical states were one of the main causes of misidentification. besides, when the image quality was low (e.g., low-resolution, blocking artifacts, ringing effects, and blurring) or when the faces were partly occluded by objects or other faces, srcnet had a higher error rate. in addition, srcnet was likely to make bias errors when the color of a facemask was close to the facial skin color. examples of identification results are shown in figures 7 and 8 . although the face positions and types of facemasks vary, srcnet correctly identified all facemask-wearing conditions with high confidence. as analyzed from failed cases, the critical states (wearing facemask between cfw and ifw), image quality, and blocked faces were the three main reasons for identification errors. the ways of wearing facemasks were continuous variables, while the classification results were discrete; hence, critical states were one of the main causes of misidentification. besides, when the image quality was low (e.g., low-resolution, blocking artifacts, ringing effects, and blurring) or when the faces were partly occluded by objects or other faces, srcnet had a higher error rate. in addition, srcnet was likely to make bias errors when the color of a facemask was close to the facial skin color. the average prediction time of srcnet was also measured, which was 0.03 s (0.013 s for sr network, and 0.017 s for facemask-wearing condition identification network) for a single face when implemented with matlab 2020a, a i7 cpu, and p600 gpu with 4 gb memory. although the sr process is time-consuming, it improved the srcnet performance especially in extreme situations. the processing time for a single image depended on the number of facial images, but, generally, it is much shorter than the sum of faces, as we could use parallel tools to shorten the time. the average processing time for a single image with around six faces was about 0.1 s. the comparison result with the none-deep learning method is shown in table 9 , including svm with hog features and knn [46, 82] . the srcnet outperformed these methods in both accuracy and kappa. in addition, the svm and knn had better performance with images processed by the sr network, which were the same results as the ablation experiment of srcnet. our study presented a novel algorithm to identify facemask-wearing condition, which involved four main steps: image pre-processing, facial detection and cropping, sr, and facemask-wearing condition identification. we proposed srcnet with a refined sr network to improve its performance on low-quality images. the results indicate that, by using sr before classification, cnns can achieve higher accuracy. besides, our experiment proved that deep learning methods can be used to identify the average prediction time of srcnet was also measured, which was 0.03 s (0.013 s for sr network, and 0.017 s for facemask-wearing condition identification network) for a single face when implemented with matlab 2020a, a i7 cpu, and p600 gpu with 4 gb memory. although the sr process is time-consuming, it improved the srcnet performance especially in extreme situations. the processing time for a single image depended on the number of facial images, but, generally, it is much shorter than the sum of faces, as we could use parallel tools to shorten the time. the average processing time for a single image with around six faces was about 0.1 s. the comparison result with the none-deep learning method is shown in table 9 , including svm with hog features and knn [46, 82] . the srcnet outperformed these methods in both accuracy and kappa. in addition, the svm and knn had better performance with images processed by the sr network, which were the same results as the ablation experiment of srcnet. our study presented a novel algorithm to identify facemask-wearing condition, which involved four main steps: image pre-processing, facial detection and cropping, sr, and facemask-wearing condition identification. we proposed srcnet with a refined sr network to improve its performance on low-quality images. the results indicate that, by using sr before classification, cnns can achieve higher accuracy. besides, our experiment proved that deep learning methods can be used to identify facemask-wearing conditions, thus having potential applications in epidemic prevention involving covid-19. this study was mainly based on large-scale facial image datasets and the medical masks dataset. for the sr network, we proposed a new network architecture, including improvements in the activation functions and the density of skip connections. these innovations led to considerable performance gains in detail enhancement and image restoration, compared to previous state-of-the-art methods, as evaluated by psnr and ssim. the performance of the sr network was also visualized using images with different resolutions, where the proposed sr network restored more details and contributed to the performance of identifying facemask-wearing condition. for facemask-wearing condition identification, the proposed srcnet innovatively combined the sr network with a facial identification cnn for performance improvement. image pre-processing was utilized in srcnet for better performance, eliminating the irrelevant variables in images such as background, different cameras, exposures, and contrast. in addition, superior detection of the facial area could be achieved with pre-processed images. transfer learning was also applied during facemask-wearing condition identification network training. finally, srcnet achieved a 98.70% accuracy in three-category classifications (nfw, ifw, and cfw) and outperformed traditional end-to-end image classification methods without the sr network by over 1.5% in kappa. an ablation experiment was also conducted to illustrate the effects of transfer learning and the sr network, which were shown to contribute to the final network performances. identification results in different types of facemasks with different colors were also illustrated to demonstrate the robustness of srcnet. in addition, none-deep-learning approaches, including svm and knn, were also compared and analyzed, where better performances were achieved with the sr network. our findings indicate that, by using an sr network and a pre-trained general facial recognition model, srcnet can achieve highly accurate results in identifying facemask-wearing condition. the identification of facemask-wearing conditions has many similarities to facial recognition. however, the development of a facemask-wearing condition identification network is challenging for several reasons. the limitation in datasets is one main challenge. the facemask-wearing condition datasets are generally small, and their image quality is not high enough, compared to general facial recognition datasets. furthermore, the various performances of wearing facemasks incorrectly largely increases the difficulty of identification. to overcome these challenges, srcnet was introduced, which utilizes both an sr network and transfer learning before classification. the sr network solved the low-quality image problem, while transfer learning solved the challenge of using a small dataset with various wearing-facemask-incorrectly examples; with these methods, the performance improved considerably. to our knowledge, there have not been any studies on facemask-wearing condition identification using deep learning. in our study, facemask-wearing condition was detected with 98.70% accuracy, indicating that srcnet has great potential to support automatic facemask-wearing condition identification applications. the design of srcnet also considers network complexity, being based on lightweight and efficient cnns for real-time facemask-wearing condition identification. the low computing resource requirements of srcnet mean that it can be applied in public, using internet of things (iot) technologies, and is meaningful to urge the public to correctly wear facemasks for epidemic prevention. a new facemask-wearing condition identification method was proposed, which combines an sr network with a classification network (srcnet) for facial image classification. to identify facemaskwearing condition, the input images were processed with image pre-processing, facial detection and cropping, sr, and facemask-wearing condition identification. finally, srcnet achieved a 98.70% accuracy and outperformed traditional end-to-end image classification methods by over 1.5% in kappa. our findings indicate that the proposed srcnet can achieve high accuracy in facemask-wearing condition identification, which is meaningful for the prevention of epidemic diseases including covid-19 in public. there are a few limitations to our study. firstly, the medical masks dataset we used for facemask-wearing condition identification is relatively small, where it cannot cover all postures or environments. in addition, the dataset does not contain video, where the identification result on a video stream cannot be tested. as for the proposed algorithm, the identification time for a single image is a little long, where an average of 10 images can be identified in a second, which does not meet the basic video frame rate of 24 frames per second (fps). in future studies, a more extensive facemask-wearing data set including images and videos will be collected and labelled with more details, in order to improve the performance of srcnet. the data set shall contain faces with different postures, environments, and lighting conditions. in addition, srcnet will be improved, based on either single image or video with iot technologies, and a more efficient and accurate algorithm will be explored, which can contribute to the practical application of identifying facemask-wearing condition. coronavirus disease (covid-19): a primer for emergency physicians facemasks and hand hygiene to prevent influenza transmission in households: a cluster randomized trial mathematical modeling of the effectiveness of facemasks in reducing the spread of novel influenza a (h1n1) physical interventions to interrupt or reduce the spread of respiratory viruses the use of facemasks to prevent respiratory infection: a literature review in the context of the health belief model effectiveness of facemasks to reduce exposure hazards for airborne infections among general populations covid-19: facemask use prevalence in international airports in asia physical interventions to interrupt or reduce the spread of respiratory viruses: systematic review convolutional networks and applications in vision automatic identification of down syndrome using facial images with deep convolutional neural network the visual social distancing problem joint face detection and alignment using multitask cascaded convolutional networks imagenet classification with deep convolutional neural networks toward automatic phenotyping of developing embryos from videos cnn-based multimodal human recognition in surveillance environments an identity authentication method combining liveness detection and face recognition image restoration using convolutional auto-encoders with symmetric skip connections. arxiv 2016 residual dense network for image super-resolution detail-sensitive panoramic annular semantic segmentation through swaftnet for surrounding sensing. arxiv 2019 fusion of aerial images with mean shift-based upsampled elevation data for improved building block classification fractional bat and multi-kernel-based spherical svm for low resolution face recognition deep multiscale spectral-spatial feature fusion for hyperspectral images classification attention-aware perceptual enhancement nets for low-resolution image classification image super-resolution via sparse representation a+: adjusted anchored neighborhood regression for fast super-resolution deeply-recursive convolutional network for image super-resolution. arxiv 2015 accelerating the super-resolution convolutional neural network perceptual losses for real-time style transfer and super-resolution accurate image super-resolution using very deep convolutional networks real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network end-to-end learning face super-resolution with facial priors deep laplacian pyramid networks for fast and accurate super-resolution enhanced deep residual networks for single image super-resolution image super-resolution using deep convolutional networks learning a deep convolutional network for image super-resolution very deep convolutional networks for large-scale image recognition. arxiv going deeper with convolutions deep residual learning for image recognition inception-v4 xception: deep learning with depthwise separable convolutions densely connected convolutional networks. arxiv 2016 efficient convolutional neural networks for mobile vision applications. arxiv 2017 mobilenetv2: inverted residuals and linear bottlenecks. arxiv 2018 shufflenet: an extremely efficient convolutional neural network for mobile devices alexnet-level accuracy with 50x fewer parameters and <0.5mb model size. arxiv 2016 support vector machines imagenet: a large-scale hierarchical image database learning multiple layers of features from tiny images dropout: a simple way to prevent neural networks from overfitting polosukhin, i. attention is all you need comparison of regularization methods for imagenet classification with deep convolutional neural networks cmnet: a connect-and-merge convolutional neural network for fast vehicle detection in urban traffic surveillance a deep network architecture for super-resolution-aided hyperspectral image classification with classwise loss highly accurate facial nerve segmentation refinement from cbct/ct imaging using a super-resolution classification approach object detection by a super-resolution method and a convolutional neural networks convolutional low-resolution fine-grained classification very low resolution face recognition problem facial image super resolution using sparse representation for improving face recognition in surveillance monitoring glasses detection and extraction by deformable contour a real-time big data architecture for glasses detection using computer vision techniques glasses detection on real images based on robust alignment precise glasses detection algorithm for face with in-plane rotation glasses detection using convolutional neural networks a convolutional neural network based approach towards real-time hard hat detection illuminant and device invariant colour using histogram equalisation a comparison of deep networks with relu activation function and linear spline-type methods searching for activation functions. arxiv 2017 identifying facial phenotypes of genetic disorders using deep learning a fast learning algorithm for deep belief nets reducing the dimensionality of data with neural networks weakly supervised building segmentation by combining superpixel pooling and multi-scale feature fusion. remote sens. 2020 from facial parts responses to face detection: a deep learning approach learning face representation from scratch a method for stochastic optimization. arxiv improving neural networks by preventing co-adaptation of feature detectors image quality assessment: from error visibility to structural similarity filters for common resampling tasks memnet: a persistent memory network for image restoration rethinking the inception architecture for computer vision open source neural networks in c. 2013-2016 histograms of oriented gradients for human detection this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license the authors declare no conflict of interest.code availability: code used in the present work, including image processing and network structures, is available at https://github.com/brightqin/srcnet. the public dataset medical masks dataset is available at https://www.kaggle.com/vtech6/ medical-masks-dataset. key: cord-031663-i71w0es7 authors: giacobbe, mirco; henzinger, thomas a.; lechner, mathias title: how many bits does it take to quantize your neural network? date: 2020-03-13 journal: tools and algorithms for the construction and analysis of systems doi: 10.1007/978-3-030-45237-7_5 sha: doc_id: 31663 cord_uid: i71w0es7 quantization converts neural networks into low-bit fixed-point computations which can be carried out by efficient integer-only hardware, and is standard practice for the deployment of neural networks on real-time embedded devices. however, like their real-numbered counterpart, quantized networks are not immune to malicious misclassification caused by adversarial attacks. we investigate how quantization affects a network’s robustness to adversarial attacks, which is a formal verification question. we show that neither robustness nor non-robustness are monotonic with changing the number of bits for the representation and, also, neither are preserved by quantization from a real-numbered network. for this reason, we introduce a verification method for quantized neural networks which, using smt solving over bit-vectors, accounts for their exact, bit-precise semantics. we built a tool and analyzed the effect of quantization on a classifier for the mnist dataset. we demonstrate that, compared to our method, existing methods for the analysis of real-numbered networks often derive false conclusions about their quantizations, both when determining robustness and when detecting attacks, and that existing methods for quantized networks often miss attacks. furthermore, we applied our method beyond robustness, showing how the number of bits in quantization enlarges the gender bias of a predictor for students’ grades. deep neural networks are powerful machine learning models, and are becoming increasingly popular in software development. since recent years, they have pervaded our lives: think about the language recognition system of a voice assistant, the computer vision employed in face recognition or self driving, not to talk about many decision-making tasks that are hidden under the hood. however, this also subjects them to the resource limits that real-time embedded devices impose. mainly, the requirements are low energy consumption, as they often run on batteries, and low latency, both to maintain user engagement and to effectively interact with the physical world. this translates into specializing our computation by reducing the memory footprint and instruction set, to minimize cache misses and avoid costly hardware operations. for this purpose, quantization compresses neural networks, which are traditionally run over 32-bit floating-point arithmetic, into computations that require bit-wise and integeronly arithmetic over small words, e.g., 8 bits. quantization is the standard technique for the deployment of neural networks on mobile and embedded devices, and is implemented in tensorflow lite [13] . in this work, we investigate the robustness of quantized networks to adversarial attacks and, more generally, formal verification questions for quantized neural networks. adversarial attacks are a well-known vulnerability of neural networks [24] . for instance, a self-driving car can be tricked into confusing a stop sign with a speed limit sign [9] , or a home automation system can be commanded to deactivate the security camera by a voice reciting poetry [22] . the attack is carried out by superposing the innocuous input with a crafted perturbation that is imperceptible to humans. formally, the attack lies within the neighborhood of a known-to-be-innocuous input, according to some notion of distance. the fraction of samples (from a large set of test inputs) that do not admit attacks determines the robustness of the network. we ask ourselves how quantization affects a network's robustness or, dually, how many bits it takes to ensure robustness above some specific threshold. this amounts to proving that, for a set of given quantizations and inputs, there does not exists an attack, which is a formal verification question. the formal verification of neural networks has been addressed either by overapproximating-as happens in abstract interpretation-the space of outputs given a space of attacks, or by searching-as it happens in smt-solving-for a variable assignment that witnesses an attack. the first category include methods that relax the neural networks into computations over interval arithmetic [20] , treat them as hybrid automata [27] , or abstract them directly by using zonotopes, polyhedra [10] , or tailored abstract domains [23] . overapproximationbased methods are typically fast, but incomplete: they prove robustness but do not produce attacks. on the other hand, methods based on local gradient descent have turned out to be effective in producing attacks in many cases [16] , but sacrifice formal completeness. indeed, the search for adversarial attack is npcomplete even for the simplest (i.e., relu) networks [14] , which motivates the rise of methods based on satisfiability modulo theory (smt) and mixed integer linear programming (milp). smt-solvers have been shown not to scale beyond toy examples (20 hidden neurons) on monolithic encodings [21] , but today's specialized techniques can handle real-life benchmarks such as, neural networks for the mnist dataset. specialized tools include dlv [12] , which subdivides the problem into smaller smt instances, and planet [8] , which combines different sat and lp relaxations. reluplex takes a step further augmenting lp-solving with a custom calculus for relu networks [14] . at the other end of the spectrum, a recent milp formulation turned out effective using off-the-shelf solvers [25] . moreover, it formed the basis for sherlock [7] , which couples local search and milp, and for a specialized branch and bound algorithm [4] . all techniques mentioned above do not reason about the machine-precise semantics of the networks, neither over floating-nor over fixed-point arithmetic, but reason about a real-number relaxation. unfortunately, adversarial attacks computed over the reals are not necessarily attacks on execution architectures, in particular, for quantized networks implementations. we show, for the first time, that attacks and, more generally, robustness and vulnerability to attacks do not always transfer between real and quantized networks, and also do not always transfer monotonically with the number of bits across quantized networks. verifying the real-valued relaxation of a network may lead scenarios where (i) specifications are fulfilled by the real-valued network but not for its quantized implementation (false negative), (ii) specifications are violated by the real-valued network but fulfilled by its quantized representation (false negatives), or (iii) counterexamples witnessing that the real-valued network violated the specification, but do not witness a violation for the quantized network (invalid counterexamples/attacks). more generally, we show that all three phenomena can occur non-monotonically with the precision in the numerical representation. in other words, it may occur that a quantized network fulfills a specification while both a higher and a lower bits quantization violate it, or that the first violates it and both the higher and lower bits quantizations fulfill it; moreover, specific counterexamples may not transfer monotonically across quantizations. the verification of real-numbered neural networks using the available methods is inadequate for the analysis of their quantized implementations, and the analysis of quantized neural networks needs techniques that account for their bit-precise semantics. recently, a similar problem has been addressed for binarized neural networks, through sat-solving [18] . binarized networks represent the special case of 1-bit quantizations. for many-bit quantizations, a method based on gradient descent has been introduced recently [28] . while efficient (and sound), this method is incomplete and may produce false negatives. we introduce, for the first time, a complete method for the formal verification of quantized neural networks. our method accounts for the bit-precise semantics of quantized networks by leveraging the first-order theory of bit vectors without quantifiers (qf bv), to exactly encode hardware operations such as 2'complementation, bit-shift, integer arithmetic with overflow. on the technical side, we present a novel encoding which balances the layout of long sequences of hardware multiply-add operations occurring in quantized neural networks. as a result, we obtain a encoding into a first-order logic formula which, in contrast to a standard unbalanced linear encoding, makes the verification of quantized networks practical and amenable to modern bit-precise smt-solving. we built a tool using boolector [19] , evaluated the performance of our encoding, compared its effectiveness against real-numbered verification and gradient descent for quantized networks, and finally assessed the effect of quantization for different networks and verification questions. we measured the robustness to attacks of a neural classifier involving 890 neurons and trained on the mnist dataset (handwritten digits), for quantizations between 6 and 10 bits. first, we demonstrated that boolector, off-the-shelf and using our balanced smt encoding, can compute every attack within 16 hours, with a median time of 3h 41m, while timed-out on all instances beyond 6 bits using a standard linear encoding. second, we experimentally confirmed that both reluplex and gradient descent for quantized networks can produce false conclusions about quantized networks; in particular, spurious results occurred consistently more frequently as the number of bits in quantization decreases. finally, we discovered that, to achieve an acceptable level of robustness, it takes a higher bit quantization than is assessed by standard accuracy measures. lastly, we applied our method beyond the property of robustness. we also evaluate the effect of quantization upon the gender bias emerging from quantized predictors for students' performance in mathematics exams. more precisely, we computed the maximum predictable grade gap between any two students with identical features except for gender. the experiment showed that a substantial gap existed and was proportionally enlarged by quantization: the lower the number bits the larger the gap. we summarize our contribution in five points. first, we show that the robustness of quantized neural networks is non-monotonic in the number of bits and is non-transferable from the robustness of their real-numbered counterparts. second, we introduce the first complete method for the verification of quantized neural networks. third, we demonstrate that our encoding, in contrast to standard encodings, enabled the state-of-the-art smt-solver boolector to verify quantized networks with hundreds of neurons. fourth, we also show that existing methods determine both robustness and vulnerability of quantized networks less accurately than our bit-precise approach, in particular for low-bit quantizations. fifth, we illustrate how quantization affects the robustness of neural networks, not only with respect to adversarial attacks, but also with respect to other verification questions, specifically fairness in machine learning. a feed-forward neural network consists of a finite set of neurons x 1 , . . . , x k partitioned into a sequence of layers: an input layer with n neurons, followed by one or many hidden layers, finally followed by an output layer with m neurons. every pair of neurons x j and x i in respectively subsequent layers is associated with a weight coefficient w ij ∈ r; if the layer of x j is not subsequent to that of x i , then we assume w ij = 0. every hidden or output neuron x i is associated with a bias coefficient b i ∈ r. the real-valued semantics of the neural network gives to each neuron a real value: upon a valuation for the neurons in the input layer, every other neuron x i assumes its value according to the update rule where relu-n : r → r is the activation function. altogether, the neural network implements a function f : r n → r m whose result corresponds to the valuation for the neurons in the output layer. the activation function governs the firing logic of the neurons, layer by layer, by introducing non-linearity in the system. among the most popular activation functions are purely non-linear functions, such as the tangent hyperbolic and the sigmoidal function, and piece-wise linear functions, better known as rectified linear units (relu) [17] . relu consists of the function that takes the positive part of its argument, i.e., relu(x) = max{x, 0}. we consider the variant of relu that imposes a cap value n , known as relu-n [15] . precisely which can be alternatively seen as a concatenation of two relu functions (see eq. 10). as a consequence, all neural networks we treat are full-fledged relu networks; their real-valued versions are amenable to state-of-the-art verification tools including reluplex, but neither account for the exact floating-nor fixedpoint execution models. quantizing consists of converting a neural network over real numbers, which is normally deployed on floating-point architectures, into a neural network over integers, whose semantics corresponds to a computation over fixed-point arithmetic [13] . specifically, fixed-point arithmetic can be carried out by integer-only architectures and possibly over small words, e.g., 8 bits. all numbers are represented in 2's complement over b bits words and f bits are reserved to the fractional part: we call the result a b-bits quantization in qf arithmetic. more concretely, the conversion follows from the rounding of weight and bias coeffiwhere rnd(·) stands for any rounding to an integer. then, the fundamental relation between a quantized valueā and its real counterpart a is consequently, the semantics of a quantized neural network corresponds to the update rule in eq. 1 after substituting of x, w, and b with the respective approximants 2 −fx , 2 −fw , and 2 −fb . namely, the semantics amounts tō where int(·) truncates the fractional part of its argument or, in other words, rounds towards zero. in summary, the update rule for the quantized semantics consists of four parts. the first part, i.e., the linear combination k j=1w ijxj , propagates all neurons values from the previous layer, obtaining a value with possibly 2b fractional bits. the second scales the result by 2 −f truncating the fractional part by, in practice, applying an arithmetic shift to the right of f bits. finally, the third applies the biasb and the fourth clamps the result between 0 and 2 f n . as a result, a quantize neural network realizes a function f : z n → z m , which exactly represents the concrete (integer-only) hardware execution. we assume all intermediate values, e.g., of the linear combination, to be fully representable as, coherently with the common execution platforms [13] , we always allocate enough bits for under and overflow not to happen. hence, any loss of precision from the respective real-numbered network happens exclusively, at each layer, as a consequence of rounding the result of the linear combination to f fractional bits. notably, rounding causes the robustness to adversarial attacks of quantized networks with different quantization levels to be independent of one another, and independent of their real counterpart. a neural classifier is a neural network that maps a n-dimensional input to one out of m classes, each of which is identified by the output neuron with the largest value, i.e., for the output values z 1 , . . . , z m , the choice is given by for example, a classifier for handwritten digits takes in input the pixels of an image and returns 10 outputs z 0 , . . . , z 9 , where the largest indicates the digit the image represents. an adversarial attack is a perturbation for a sample input original + perturbation = attack that, according to some notion of closeness, is indistinguishable from the original, but tricks the classifier into inferring an incorrect class. the attack in fig. 1 is indistinguishable from the original by the human eye, but induces our classifier to assign the largest value to z 3 , rather than z 9 , misclassifying the digit as a 3. for this example, misclassification happens consistently, both on the realnumbered and on the respective 8-bits quantized network in q4 arithmetic. unfortunately, attacks do not necessarily transfer between real and quantized networks and neither between quantized networks for different precision. more generally, attacks and, dually, robustness to attacks are non-monotonic with the number of bits. we give a prototypical example for the non-monotonicity of quantized networks in fig. 2 . the network consists of one input, 4 hidden, and 2 output neurons, respectively from left to right. weights and bias coefficients, which are annotated on the edges, are all fully representable in q1. for the neurons in the top row we show, respectively from top to bottom, the valuations obtained using a q3, q2, and q1 quantization of the network (following eq. 4); precisely, we show their fractional counterpartx/2 f . we evaluate all quantizations and obtain that the valuations for the top output neuron are non-monotonic with the number of fractional bits; in fact, the q1 dominates the q3 which dominates the q2 output. coincidentally, the valuations for the q3 quantization correspond to the valuations with real-number precision (i.e., never undergo truncation), indicating that also real and quantized networks are similarly incomparable. notably, all phenomena occur both for quantized networks with rounding towards zero (as we show in the example), and with rounding to the nearest, which is naturally non-monotonic (e.g., 5/16 rounds to 1/2, 1/4, and 3/8 with, resp., q1, q2, and q3). non-monotonicity of the output causes non-monotonicity of robustness, as we can put the decision boundary of the classifier so as to put q2 into a different class than q1 and q3. suppose the original sample is 3/2 and its class is associated with the output neuron on the top, and suppose attacks can only lay in the neighboring interval 3/2 ± 1. in this case, we obtain that the q2 network admits an attack, because the bottom output neuron can take 5/2, that is larger than 2. on the other hand, the bottom output can never exceed 3/8 and 1/2, hence q1 and q3 are robust. dually, also non-robustness is non-monotonic as, for the sample 9/2 whose class corresponds to the bottom neuron, for the interval 9/2 ± 2, q2 is robust while both q3 and q1 are vulnerable. notably, the specific attacks of q3 and q1 also do not always coincide as, for instance, 7/2. robustness and non-robustness are non-monotonic in the number of bits for quantized networks. as a consequence, verifying a high-bits quantization, or a real-valued network, may derive false conclusions about a target lower-bits quantization, in either direction. specifically, for the question as for whether an attack exists, we may have both (i) false negatives, i.e., the verified network is robust but the target network admits an attack, and (ii) false positives, i.e., the verified network is vulnerable while the target network robust. in addition, we may also have (iii) true positives with invalid attacks, i.e., both are vulnerable but the found attack do not transfer to the target network. for these reasons we introduce a verification method quantized neural network that accounts for their bit-precise semantics. bit-precise smt-solving comprises various technologies for deciding the satisfiability of first-order logic formulae, whose variables are interpreted as bit-vectors of fixed size. in particular, it produces satisfying assignments (if any exist) for formulae that include bitwise and arithmetic operators, whose semantics corresponds to that of hardware architectures. for instance, we can encode bit-shifts, 2's complementation, multiplication and addition with overflow, signed and unsigned comparisons. more precisely, this is the quantifier-free first-order theory of bit-vectors (i.e., qf bv), which we employ to produce a monolithic encoding of the verification problem for quantized neural networks. a verification problem for the neural networks f 1 , . . . , f k consists of checking the validity of a statement of the form where ϕ is a predicate over the inputs and ψ over the outputs of all networks; in other words, it consists of checking an input-output relation, which generalizes various verification questions, including robustness to adversarial attacks and fairness in machine learning, which we treat in sec. 5. for the purpose of smt solving, we encode the verification problem in eq. 6, which is a validity question, by its dual satisfiability question whose satisfying assignments constitute counterexamples for the contract. the formula consists of three conjuncts: the rightmost constraints the input within the assumption, the leftmost forces the output to violate the guarantee, while the one in the middle relates inputs and outputs by the semantics of the neural networks. the semantics of the network consists of the bit-level translation of the update rule in eq. 4 over all neurons, which we encode in the formula each conjunct in the formula employs three variables x, x , and x and is made of three, respective, parts. the first part accounts for the operation of clamping between 0 and 2 f n , whose semantics is given by the formula relu-m (x) = ite(sign(x), 0, ite(x ≥ m, m, x)). then, the second part accounts for the operations of scaling and biasing. in particular, it encodes the operation of rounding by truncation scaling, i.e., int(2 −f x), as an arithmetic shift to the right. finally, the last part accounts for the propagation of values from the previous layer, which, despite the obvious optimization of pruning away all monomials with null coefficient, often consists of long linear combinations, whose exact semantic amounts to a sequence of multiply-add operations over an accumulator; particularly, encoding it requires care in choosing variables size and association layout. the size of the bit-vector variables determines whether overflows can occur. in particular, since every monomial w ij x j consists of the multiplication of two b-bits variables, its result requires 2b bits in the worst case; since summation increases the value linearly, its result requires a logarithmic amount of extra bits in the number of summands (regardless of the layout). provided that, we avoid overflow by using variables of 2b + log k bits, where k is the number of summands. the association layout is not unique and, more precisely, varies with the order of construction of the long summation. for instance, associating from left to right produces a linear layout, as in fig. 3a . long linear combonations occurring in quantized neural networks are implemented as sequences of multiply-add operations over a single accumulator; this naturally induces a linear encoding. instead, for the purpose formal verification, we propose a novel encoding which re-associates the linear combination by recursively splitting the sum into equal parts, producing a balanced layout as in fig. 3b . while linear and balanced layouts are semantically equivalent, we have observed that, in practice, the second impacted positively the performance of the smt-solver as we discuss in sec. 5, where we also compare against other methods and investigate different verification questions. we set up an experimental evaluation benchmark based on the mnist dataset to answer the following three questions. first, how does our balanced encoding scheme impact the runtime of different smt solvers compared to a standard linear encoding? then, how often can robustness properties, that are proven for the real-valued network, transferred to the quantized network and vice versa? finally, how often do gradient based attacking procedures miss attacks for quantized networks? the mnist dataset is a well-studied computer vision benchmark, which consists of 70,000 handwritten digits represented by 28-by-28 pixel images with a single 8-bit grayscale channel. each sample belongs to exactly one category {0, 1, . . . 9}, which a machine learning model must predict from the raw pixel values. the mnist set is split into 60,000 training and 10,000 test samples. we trained a neural network classifier on mnist, following a post-training quantization scheme [13] . first, we trained, using tensorflow with floating-point precision, a network composed of 784 inputs, 2 hidden layers of size 64, 32 with relu-7 activation function and 10 outputs, for a total of 890 neurons. the classifier yielded a standard accuracy, i.e., the ratio of samples that are correctly classified out of all samples in the testing set, of 94.7% on the floating-point architecture. afterward, we quantized the network with various bit sizes, with the exception of imposing the input layer to be always quantized in 8 bits, i.e., the original precision of the samples. the quantized networks required at least q3 with 7 total bits to obtain an accuracy above 90% and q5 with 10 bits to reach 94%. for this reason, we focused our study on the quantizations from 6 and the 10 bits in, respectively, q2 to q6 arithmetic. robust accuracy or, more simply, robustness measure the ratio of robust samples: for the distance ε > 0, a sample a is robust when, for all its perturbations y within that distance, the classifier class • f chooses the original class c = class • f (a). in other words, a is robust if, for all y where, in particular, the right-hand side can be encoded as m j=1 z j ≤ z c , for z = f (y). robustness is a validity question as in eq. 6 and any witness for the dual satisfiability question constitutes an adversarial attack. we checked the robustness of our selected networks over the first 300 test samples from the dataset with ε = 1 on the first 200 and ε = 2 on the next 100; in particular, we tested our encoding using the smt-solver boolector [19] , z3 [5] , and cvc4 [3] , off-the-shelf. our experiments serve two purposes. the first is evaluating the scalability and precision of our approach. as for scalability, we study how encoding layout, i.e., linear or balanced, and the number of bits affect the runtime of the smtsolver. as for precision, we measured the gap between our method and both a formal verifier for real-numbered networks, i.e., reluplex [14] , and the ifgsm algorithm [28] , with respect to the accuracy of identifying robust and vulnerable samples. the second purpose of our experiments is evaluating the effect of quantization on the robustness to attacks of our mnist classifier and, with an additional experiment, measuring the effect of quantization over the gender fairness of a student grades predictor, also demonstrating the expressiveness of our method beyond adversarial attacks. as we only compared the verification outcomes, any complete verifier for real-numbered networks would lead to the same results as those obtained with reluplex. note that these tools verify the real-numbered abstraction of the network using some form of linear real arithmetic reasoning. consequently, rounding errors introduced by the floating-point implementation of both, the network and the verifier, are not taken into account. we evaluated whether our balanced encoding strategy, compared to a standard linear encoding, can improve the scalability of contemporary smt solvers for quantifier-free bit-vectors (qf bv) to check specifications of quantized neural networks. we ran all our experiments on an intel xeon w-2175 cpu, with 64gb memory, 128gb swap file, and 16 hours of time budget per problem instance. we encoded each instance using the two variants, the standard linear and our balanced layout. we scheduled 14 solver instances in parallel, i.e., the number of physical processor cores available on our machine. while z3, cvc4 and yices2 smt-solver encoding 6-bit 7-bit 8-bit 9-bit 10-bit boolector [19] linear (standard) 3h 25m oot oot oot oot balanced (ours) 18m 1h 29m 3h 41m 5h 34m 8h 58m table 1 : median runtimes for bit-exact robustness checks. the term oot refers to timeouts, and oom refers to out-of-memory errors. due to the poor performance of z3, cvc4, and yices2 on our smallest 6-bit network, we abstained from running experiments involving more than 6 bits, i.e., entries marked by a dash (-). timed out or ran out of memory on the 6-bit network, boolector could check the instances of our smallest network within the given time budget, independently of the employed encoding scheme. our results align with the smt-solver performances reported by the smt-comp 2019 competition in the qf bv division [11] . consequently, we will focus our discussion on the results obtained with boolector. with linear layout boolector timed-out on all instances but the smallest networks (6 bits), while with the balanced layout it checked all instances with an overall median runtime of 3h 41m and, as shown in tab. 1, roughly doubling at every bits increase, as also confirmed by the histogram in fig. 4 . our results demonstrate that our balanced association layout improves the performance of the smt-solver, enabling it to scale to networks beyond 6 bits. conversely, a standard linear encoding turned out to be ineffective on all tested smt solvers. besides, our method tackled networks with 890 neurons which, while small compared to state-of-the-art image classification models, already pose challenging benchmarks for the formal verification task. in the real-numbered world, for instance, off-the-shelf solvers could initially tackle up to 20 neurons [20] , and modern techniques, while faster, are often evaluated on networks below 1000 neurons [14, 4] . additionally, we pushed our method to its limits, refining our mnist network to a four-layers deep convolutional network (2 conv + 2 fully-connected layers) with a total of 2238 neurons, which achieved a test accuracy of 98.56%. while for the 6-bits quantization we proved robustness for 99% of the tested samples within a median runtime of 3h 39min, for 7-bits and above all instances timed-out. notably, reluplex also failed on the real-numbered version, reporting numerical instability. looking at existing methods for verification, one has two options to verify quantized neural networks: verifying the real-valued network and hoping the functional property is preserved when quantizing the network, or relying on incomplete methods and hoping no counterexample is missed. a question that emerges is how accurate are these two approaches for verifying robustness of a quantized network? to answer this question, we used reluplex [14] to prove the robustness of the real-valued network. additionally, we compared to the iterative fast gradient sign method (ifgsm), which has recently been proposed to generate ∞ -bounded adversarial attacks for quantized networks [28] ; notably, ifgsm is incomplete in the sense that it may miss attacks. we then compared these two verification outcomes to the ground-truth obtained by our approach. in our study, we employ the following notation. we use the term "false negative" (i) to describe cases in which the quantized network can be attacked, while no attack exists that fools the real-number network. conversely, the term "false positive" (ii) describes the cases in which a real-number attack exists while the quantized network is robust. furthermore, we use the term "invalid attack" (iii) to specify attacks produced for the real-valued network that fools the real-valued network but not the quantized network. regarding the real-numbered encoding, reluplex accepts only pure relu networks. for this reason, we translate our relu-n networks into functionally equivalent relu networks, by translating each layer with out of the 300 samples, at least one method timed out on 56 samples, leaving us with 244 samples whose results were computed over all networks. tab. 2 depicts how frequently the robustness property could be transferred from the real-valued network to the quantized networks. not surprisingly, we observed the trend that when increasing the precision of the network, the error between the quantized model and the real-valued model decreases. however, even for the 10-bit model, in 0.8% of the tested samples, verifying the real-valued model leads to a wrong conclusion about the robustness of the quantized network. moreover, our results show the existence of samples where the 10-bit network is robustness while the real-valued is attackable and vice versa. the invalid attacks illustrate that the higher the precision of the quantization, the more targeted attacks need to be. for instance, while 94% of attacks generated for the real-valued network represented valid attacks on the 7-bit model, this percentage decrease to 80% for the 10-bit network. table 2 : transferability of vulnerability from the verification outcome of the realvalued network to the verification outcome of the quantized model. while vulnerability is transferable between the real-valued and the higher precision networks, (9 and 10-bits) , in most of the tested cases, this discrepancy significantly increases when compressing the networks with fewer bits, i.e. see columns (i) and (ii). next, we compared how well incomplete methods are suited to reason about the robustness of quantized neural networks. we employed ifgsm to attack the 244 test samples for which we obtained the ground-truth robustness and measure how often ifgsm is correct about assessing the robustness of the network. for the sake of completeness, we perform the same analysis for the real-valued network. our results in tab. 3 present the trend that with higher precision, e.g., 10bits or reals, incomplete methods provide a stable estimate about the robustness of the network, i.e., ifgsm was able to find attacks for all non-robust samples. however, for lower precision levels, ifgsm missed a substantial amount of attacks, i.e., for the 7-bit network, ifgsm could not find a valid attack for 10% of the non-robust samples. in tab. 3 we show how standard accuracy and robust accuracy degrade on our mnist classifier when increasing the compression level. the data indicates a constant discrepancy between standard accuracy and robustness; for real numbered networks, a similar fact was already known in the literature [26] : we empirically confirm that observation for our quantized networks, whose discrepancy fluctuated between 3 and 4% across all precision levels. besides, while an acceptable, larger than 90%, standard accuracy was achieved at 7 bits, an equally acceptable robustness was achieved at 9 bits. one relationship not shown in tab. 3 is that these 4% of non-robust samples are not equal for across quantization levels. for instance, we observed samples that are robust for 7-bit network but attackable when quantizing with 9-and 10bits. conversely, there are attacks for the 7-bit networks that are robust samples in the 8-bit network. concerns have been raised that decisions of an ml system could discriminate towards certain groups due to a bias in the training data [2] . a vital issue in quantifying fairness is that neural networks are black-boxes, which makes it hard to explain how each input contributes to a particular decision. we trained a network on a publicly available dataset consisting of 1000 students' personal information and academic test scores [1] . the personal features include gender, parental level of education, lunch plans, and whether the student took a preparation course for the test, all of which are discrete variables. we train a predictor for students' math scores, which is a discrete variable between 0 and 100. notably, the dataset contains a potential source for gender bias: the mean math score among females is 63.63, while it is 68.73 among males. the network we trained is composed of 2 hidden layers with 64 and 32 units, respectively. we use a 7-bit quantization-aware training scheme, achieving a 4.14% mean absolute error, i.e., the difference between predicted and actual math scores on the test set. the network is fair if the gender of a person influences the predicted math score by at most the bias β. in other words, checking fairness amounts to verifying that i =gender is valid over the variables s and t, which respectively model two students for which gender differs but all other features are identical-we call them twin students. when we encode the dual formula, we encode two copies of the semantics of the same network: to one copy we give one student s and take the respective grade g, to the other we give its twin t and take grade h; precisely, we check for the unsatisfiability the negation of formula in eq. 11. then, we compute a tight upper bound for the bias, that is the maximum possible change in predicted score for any two twins. to compute the tightest bias, we progressively increase β until our encoded formula becomes unsatisfiable. we measure mean test error and gender bias of the 6-to the 10-bits quantization of the networks. we show the results in tab between 4.1 and 4.6% among all quantizations, showing that the change in precision did not affect the quality of the network in a way that was perceivable by standard measures. however, our formal analysis confirmed a gender bias in the network, producing twins with a 15 to 21 difference in predicted math score. surprisingly, the bias monotonically increased as the precision level in quantization lowered, indicating to us that quantization plays a role in determining the bias. we introduced the first complete method for the verification of quantized neural networks which, by smt solving over bit-vectors, accounts for their bit-precise semantics. we demonstrated, both theoretically and experimentally, that bitprecise reasoning is necessary to accurately ensure the robustness to adversarial attacks of a quantized network. we showed that robustness and non-robustness are non-monotonic in the number of bits for the numerical representation and that, consequently, the analysis of high-bits or real-numbered networks may derive false conclusions about their lower-bits quantizations. experimentally, we confirmed that real-valued solvers produce many spurious results, especially on low-bit quantizations, and that also gradient descent may miss attacks. additionally, we showed that quantization indeed affects not only robustness, but also other properties of neural networks, such as fairness. we also demonstrated that, using our balanced encoding, off-the-shelf smt-solving can analyze networks with hundreds of neurons which, despite hitting the limits of current solvers, establishes an encouraging baseline for future research. use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. students performance in exams fairness in machine learning cvc4. in: international conference on computer aided verification a unified view of piecewise linear neural network verification z3: an efficient smt solver yices 2.2 output range analysis for deep feedforward neural networks formal verification of piece-wise linear feed-forward neural networks robust physical-world attacks on deep learning models ai2: safety and robustness certification of neural networks with abstract interpretation smt-comp 2019 safety verification of deep neural networks. in: cav (1) quantization and training of neural networks for efficient integerarithmetic-only inference reluplex: an efficient smt solver for verifying deep neural networks convolutional deep belief networks on cifar-10. unpublished manuscript deepfool: a simple and accurate method to fool deep neural networks rectified linear units improve restricted boltzmann machines verifying properties of binarized deep neural networks. in: aaai boolector 2.0 an abstraction-refinement approach to verification of artificial neural networks challenging smt solvers to verify neural networks adversarial attacks against automatic speech recognition systems via psychoacoustic hiding an abstract domain for certifying neural networks intriguing properties of neural networks evaluating robustness of neural networks with mixed integer programming robustness may be at odds with accuracy output reachable set estimation and verification for multilayer neural networks to compress or not to compress: understanding the interactions between adversarial attacks and neural network compression an early version of this paper was put into the easychair repository as easychair preprint no. 1000. this research was supported in part by the austrian science fund (fwf) under grants s11402-n23(rise/shine) and z211-n23 (wittgenstein award), in part by the aerospace technology institute (ati), the department for business, energy & industrial strategy (beis), and innovate uk under the hiclass project (113213). key: cord-203872-r3vb1m5p authors: baten, raiyan abdul; ghoshal, gourab; hoque, mohammed ehsan title: availability of demographic cues can negatively impact creativity in dynamic social networks date: 2020-07-12 journal: nan doi: nan sha: doc_id: 203872 cord_uid: r3vb1m5p as the world braces itself for a pandemic-induced surge in automation and a consequent (accelerated) shift in the nature of jobs, it is essential now more than ever to understand how people's creative performances are impacted by their interactions with peers in a social network. however, when it comes to creative ideation, it is unclear how the demographic cues of one's peers can influence the network dynamics and the associated performance outcomes of people. in this paper, we ask: (1) given the task of creative idea generation, how do social network connectivities adapt to people's demographic cues? (2) how are creative outcomes influenced by such demography-informed network dynamics? we find that link formations in creativity-centric networks are primarily guided by the creative performances of one's peers. however, in the presence of demographic information, the odds of same-gender links to persist increase by 82.03%, after controlling for merit-based link persistence. in essence, homophily-guided link persistence takes place when demographic cues are available. we further find that the semantic similarities between socially stimulated idea-sets increase significantly in the presence of demographic cues (p<1e-4), which is counter-productive for the purposes of divergent creativity. this result can partly be explained by the observation that people's ideas tend to be more homogeneous within demographic groups than between demographic groups (p<1e-7). therefore, choosing to maintain connections based on demographic similarity can negatively impact one's creative inspiration sources by taking away potential diversity bonuses. our results can inform intelligent intervention possibilities towards maximizing a social system's creative outcomes. the bipartite network structure used in the study. the ideas of the alters were pre-recorded and later shown to the egos in both study conditions. each ego was connected to 2 alters. (b) the study protocol for each of the 5 rounds. in turn 1, the participants generated alternate use ideas on their own for a given prompt object. in turn 2, egos in the control condition were shown only the ideas of their alters, while both the ideas and the demographic information of the alters were shown to the egos in the treatment condition. the egos could add further inspired ideas to their lists. finally, the egos rated the ideas of all the alters in the trial, and had the option to update their sets of 2 alters at the end of each round. erate creative ideas are modeled as a social network-e.g., an academic network of researchers, or a network of marketers who seek to generate creative campaign ideas-one can reasonably expect that the highly creative people will become increasingly central in the network over time due to selective link formations [13] . a paradoxical set of possibilities emerge if, in addition to the creative aptitudes of people in the network, one also considers how people's demographic identities might impact the network dynamics in a creativity-centric social system. on the one hand, there is the notion of a diversity bonus that can help in creative ideation [14] . people from different demographic identities (i.e., gender or race) come with different concerns, perspectives, and life experiences. this can impact the knowledge domains they have access to, influencing the ideas they generate [15, 16, 17] . following the previous arguments, it seems provocatively promising for humans to form links with people from different demographic identities, so as to cash in on the diversity bonuses on offer. on the other hand, literature on social network formation and growth has extensively documented homophily, a phenomenon where people tend to form and sustain links with peers similar to themselves ('birds of a feather flock together') [18] . it can then be argued that people might forego the potentials of diversity bonuses in favor of the more comfortable similarity-based network connections. unfortunately, it is unclear from previous work how social networks might adapt to demographic cues when faced with creative ideation tasks. this leads to our first research question. if indeed social network connectivities are biased by demographic cues, how might that impact creative outcomes? previous work suggests that if multiple people draw inspirations from the same/similar stimuli, then even their independently stimulated ideas can become semantically similar to each other [12] . if people form and maintain social links only with peers from particular demographic identities (i.e., homophily-guided network dynamics), then it can result in making their stimuli set uniform as the diversity bonuses will go missing. this process can, consequently, hurt the variety of ideas stimulated-leading to counter-productive outcomes in the context of divergent creativity [19] . the converse is naturally likely if instead heterophily guides the network dynamics. exploring the effects of demographic cues on creative outcomes formulates our second research query. the importance of understanding these dynamics can be better appreciated in the context of the future of work. advances in ai and automation are increasingly shifting the nature of jobs from physical labor to ones that require various social-cognitive avenues of soft skills, especially creativity [20, 21, 22, 23, 24, 25] . the covid-19 pandemic is likely to accelerate this shift in jobs [26] . future workplaces will increasingly demand people to be creatively productive together with their peers as they tackle complex problems [27] . in doing so, people will inevitably be exposed to the demographic cues of their peers [28] . understanding how such cues can bias the dynamics of creativity-centric social systems is a prerequisite to any informed intervention that can potentially amplify the creative outcomes of people. yet, studying such a creativity-centric social system poses challenges of its own. while there exists a body of literature examining the effects of diversity on creative outcomes at individual and group levels [15, 16, 14, 29, 30] , such studies typically ignore the dynamic nature of human social networks. human networks continuously change as new ties are created all the time and existing ones fade. groups (i.e., fully connected networks) or static network settings fail to incorporate the effects of such dynamic link formation and dissolution, as the subjects either do not have the agency to choose their own social stimulation sources, or even if they do, it is not possible to unambiguously track such ties. as a result, effects from social network phenomena such as homophily become lost from the picture. on the other hand, the body of network science literature offers the tools to study such complex systems robustly, and have previously examined the effects of dynamic social networks on cooperation [31] , collective intelligence [32] , and even public speaking [33] . however, this body of work falls short when it comes to the interplays among diversity, creativity and homophily in dynamic social networks, which we address in this paper. it is challenging to identify a dataset in the wild that (1) allows for traceable links between ideas and their stimuli, (2) gives agency to the participants for choosing their own inspiration sources, (3) captures temporal evolution information of the dynamic network, and (4) offers a way of presenting the peers' demographic information to the participants without introducing unwanted confounding effects. we thus resort to a laboratory setting, where we draw from the relevant bodies of literature and improvise the study design to explore the desired research questions. using a modern socialmedia-like web interface for the interactions, we are able to address the first three challenges, while the fourth challenge is overcome by the use of avatars. we employ advanced tools for analyzing temporal networks, such as the separable temporal exponential random graph model, to tease out the network formation and persistence patterns therein. we also take advantage of the recently popularized natural language processing tools for computationally comparing semantic qualities of the creative ideas in our dataset. we consider two core research questions in this paper: (1) when people are tasked with creative idea generation, how do network connectivities adapt to the availability of demographic cues? (2) how are creative outcomes influenced by such demography-informed network dynamics? we run a web-based randomized control experiment with 5 rounds of creative ideation tasks (see materials and methods). the participants were recruited from amazon mechanical turk (see si appendix). to ensure that everyone had uniform stimuli for creative ideation, we adopt a bipartite study design [34] that involved two kinds of roles for the participants: as alters (n = 12) and as egos (n = 180). the alters' ideas were pre-recorded to be used as stimuli for the egos. the egos were randomly placed into either of two conditions: (1) control (n = 90, diversity cues not shown) or (2) treatment (n = 90, diversity cues shown). we ran two trials of the study. each trial consisted of 6 alters, whose ideas were shown as stimuli to the egos of both the control and treatment conditions in that trial (see materials and methods). to get started, each ego was randomly assigned to 'follow' 2 alters (out of 6 in the trial) by the researchers. in each round, the egos first generated ideas on their own (turn-1). in the control condition, the egos were then shown the ideas of the 2 alters they were following. however, in the treatment condition, the egos were additionally shown the demographic information (gender and race) of their followee alters (see materials and methods, and si appendix). this was the only difference between the two conditions. if the egos got inspired with new ideas, they could add those to their lists (turn-2). then, the egos were shown the ideas of all 6 alters, which they rated on novelty. finally, the egos were given the opportunity to optionally follow/unfollow alters to have an updated list of 2 followee alters each. in turn-2 of the following round, they were shown the ideas of their newly chosen alters. figure 1 visualizes the protocol. further quality control measures are elaborated in the materials and methods section. same-gender links are highly stable in the presence of demographic cues we employ separable temporal exponential random graph models [35, 36] to capture the link formation and persistence dynamics in our dataset. two separate models are fit for each of the to capture the formation and persistence dynamics of the links, two separate models are fit for the temporal networks of each study condition. the formation model tracks links that do not exist in time t, but exist in time t + 1, e.g., the green link between ego e 1 and alter a 4 . the persistence model considers links that exist in both time instances, e.g., all grey links. since the egos had to follow a constant number of two alters, the red link e 1 -a 2 needed to dissolve to allow the green link to form. however, if a link does not persist, it must dissolve-thus, the dissolving effects are captured in the persistence model and we need not fit a separate model for dissolution. control and treatment networks: (1) formation models and (2) persistence models. the alters were merely passive actors in the study, and the network dynamics were solely determined by the egos' choices of their followee alters. therefore, as exogenous features, we choose three attributes that the treatment egos were most likely to consider in making their connectivity decisions: (a) the roundwise creative performances of the alters (measured by non-redundant idea counts; see materials and methods), (b) gender-based homophily and (c) race-based homophily. in addition, we employ one endogenous network feature of edge-counts, to control for network density. figure 2 summarizes the intuition, while the models and features are elaborated in the materials and methods section. in the control condition, we find that the link formations are significantly guided by the nonredundant idea counts of the alters (β = 0.324, z-value = 5.47, p < 10 −4 ). the positive β naturally suggests that better performing alters are more likely to be followed by the egos. the gender and race features do not show any significant effect once the performance-based link formations are accounted for (p > 0.05 for both). this is intuitive, as the egos did not have any information about the alters' gender and race, and could only see the ideas of the alters. we observe the same trend in the link persistence model as well. the stability of the links can be significantly captured with the non-redundant idea counts of the alters (β = 0.421, z-value = 6.96, p < 10 −4 ), which, once again, shows an intuitively positive effect. in other words, if you are a high-performing alter, you will enjoy substantial likelihoods of gaining and retaining followers. as expected, the gender and race features do not show any significant effect (p > 0.05 for both). when it comes to the treatment condition, the link formation model yet again shows only the nonredundant idea counts of the alters to be a significant predictor (β = 0.197, z-value = 3.93, p < 10 −4 ), and not the demographic features (p > 0.05 for both). however, things get interesting in the link persistence model. we find that, in the presence of demographic cues, the persistence of the links depend significantly on both non-redundant idea counts (β = 0.355, z-value = 6.08, p < 10 −4 ) and gender-based homophily (β = 0.599, z-value = 3.74, p < 10 −3 ). in other words, if a link exists between participants of the same gender, its odds of persisting increases by 82.03%, after controlling for merit-based persistence. no significant effect is observed for the race feature (p > 0.05). in summary, when it comes to link persistence, the availability of demographic cues in the treatment condition is observed to be associated with a significant stability in same-gender links, unlike what is seen in the control condition. inter-ego semantic similarity increases when the alters' demographic cues are known typical settings in convergent thinking or collective intelligence research explore how people, under various study conditions, can get close to known correct answers in estimation tasks [37, 32, 38, 39, cosine similarities between the idea-sets of pairs of egos are shown across three sub-groups: ego-pairs who share 0, 1 and 2 common alters between them. with the increase in the number of common alters (i.e., increase in stimuli similarity), the inter-ego semantic similarity typically increases. importantly, the semantic similarities are significantly higher in the treatment condition compared to the control condition across all three sub-groups. whiskers denote 95% c.i. ***p < 0.0001, corrected for multiple comparisons. 40]. these are abilities typically tested in traditional school examinations. in such explorations, it is not worrisome if the participants' responses become similar to each other due to their interactions, as we only care about how accurate the estimates are. in stark contrast, we focus on divergent thinking/creativity in this experiment, which leads an individual to come up with numerous and varied responses to a given prompt or situation [19, 41] . if the network processes result in making the stimulated ideas of the participants systematically similar, it hurts our purposes. we estimate the semantic nature of the egos' idea-sets using neural word embeddings (word2vec [42] ). to compute the semantic similarities between idea-sets, we employ the cosine similarity metric [43] (see materials and methods). previous work suggests that following the same people, i.e., having the same stimuli, can introduce semantic similarities between the idea-sets of independently ideating followers [12] . given this intuition, we consider ego-pairs in the control and treatment conditions across three sub-groups based on their number of common alters. namely, from each round, we collect pairs of egos who share (a) 2 common alters (i.e., exactly the same stimuli), (b) 1 common alter and (c) no common alter. then, we compute the semantic similarities between the ego-pair's stimulated ideas in turn-2. we adopt a 3 × 2 factorial design to analyze the data, with 3 levels in the number of common alters and 2 levels in the study condition factor. in doing so, we employ the aligned rank transform (art) procedure [44] , which is a linear mixed model-based non-parametric test. we find significant main effects for both of the factors (number of common alters: f (2, 38376) = 101.06, p < 10 −15 ; study condition: f (1, 38376) = 137.38, p < 10 −15 ). we also find a significant interaction between the two factors (f (2, 38376) = 19.23, p < 10 −8 ). post-hoc analysis on the art-fitted model reveals that the semantic similarity between ego-pairs increases as their number of common alters increases (0 vs 1 common alter: t(38376) = −6.61, p < 10 −4 ; 1 vs 2 common alter(s): t(38376) = −9.0, p < 10 −4 ). further pairwise comparisons using 2-tailed tests show that in the control condition, the idea-sets of ego-pairs who have both alters in common are significantly more similar to each other than idea-sets of ego-pairs who share one or no common alter (2 vs. 1 common alter: t(38376) = 8.57, p < 10 −4 ; 2 vs. 0 common alter: t(38376) = 9.26, p < 10 −4 ). however, there is no significant difference between the idea-sets of ego-pairs with one and no common alter (p > 0.05). this is in agreement with previously reported results [12] . in the treatment condition, the inter-ego similarities increase significantly as the number of common alters increase from 0 to 1 and also from 1 to 2 (0 vs. 1 common alter: t(38376) = −8.91, p < 10 −4 ; 1 vs. 2 common alters: t(38376) = −5.17, p < 10 −4 ). these trends intuitively follow the arguments of inter-follower similarities that can stem from having common stimulation sources. notably, we observe that the inter-ego semantic similarities are significantly higher in the treatment condition compared to their control counterparts, as revealed by post-hoc analysis on the study condition factor in the art-fitted model (t(38376) = 11.72, p < 10 −4 ). further pairwise comparisons using 2-tailed tests reveal that this result holds for all of the three common-alter-based subgroups (treatment vs. control; 2 common alters: t(38376) = 5.28, p < 10 −4 ; 1 common alter: t(38376) = 13.82, p < 10 −4 ; 0 common alter: t(38376) = 8.32, p < 10 −4 ). in other words, in the presence of demographic cues, the egos in the treatment condition not only maintained significant stability in same-gender links, but also demonstrated a significantly higher inter-ego semantic similarity compared to the egos in the control condition. all of the p -values reported here have been corrected for multiple comparisons using holm's sequential bonferroni procedure. figure 3 summarizes the results. how can we explain the increased inter-ego semantic similarity in the presence of demographic cues? we attempt to test the intuition that pairs of ideas generated by alters of the same demographic group tend to be more similar to each other than those generated by alters of different demographic groups. if that is indeed the case, it will logically follow that choosing alters from only a particular demographic group, as would result from homophily-guided network dynamics, can make a follower's stimuli idea-set uniform and similar. this can deprive the follower of any possible diversity bonuses. it can also partly explain the increase in inter-ego similarities that can stem from having similar stimuli. to that end, we first consider the sets of ideas that were uniquely submitted by the alters of male and non-male gender identities, but not both. we create vector representations for each of the distinct ideas in the two sets (see materials and methods). we then consider pairs of ideas from alters of the same gender, and compute their cosine similarities. however, we only consider idea-pairs from the same round, and if an idea-pair comes uniquely from a single person, we ignore that pair. similarly, we compute the cosine similarities between idea-pairs from alters of different genders. we find that idea-pairs within gender are indeed significantly more similar to each other than idea-pairs between gender (2-tailed test, t(4633) = 11.66, p < 10 −30 ). the same story is observed along the race dimension as well. in other words, we find that idea-pairs within race are significantly more similar to each other than idea-pairs between race (2-tailed test, t(4870) = 5.73, p < 10 −7 ). figure 4 summarizes the results. to substantiate the generality of these findings, we run the same statistical analysis on the entire dataset of our study and confirm similar significant trends of gender and race-based homogeneity in ideas (see si appendix). thus, we confirm the intuition that drawing stimuli ideas from the same demographic groups can indeed make one's inspiration sources uniform and similar. exploring how people navigate through each other's demographic differences in the society and how that affects their personal, social and professional lives is a research avenue with far-reaching practical implications. especially since the recent killing of george floyd, conversations on navigating such demographic differences constructively have seen a sharp spike. our insights join a growing body of literature that investigate how demography-driven behavior can influence human performance. we turn our focus on creativity, a soft-skill that is enjoying an accelerated demand as the world experiences a rapid shift towards automation due to the covid-19 pandemic. we find that in a creativity-centric social network, people's odds of maintaining same-gender links increase by 82.03%, after controlling for merit-based link persistence. such behavior is not observed if the demographic cues are not available to people in the first place. moreover, we find people's ideas to be more homogeneous within demographic groups than between. thus, homophily-guided link dynamics can reduce the diversity in people's creative stimuli set. this reinforces that intuition that the diversity bonuses might get compromised if one systematically maintains connections based on demographic identity. we indeed find that in the presence of demographic cues, the inter-ego semantic similarity increases significantly compared to the control condition where no such cues were shown to the participants. diversity effects, creative stimulation and social homophily-all of these are highly complex bodies of knowledge even when studied independently, driven by their own multidimensional mechanisms. the interplays among them naturally introduce further complexities into the equations. one cannot expect to understand all the subtle nuances therein with a one-size-fits-all solution. for example, expecting diversity to magically make every team superior will not be of much help. rather, given a particular goal, one needs to carefully contemplate whether and how the cognitive, functional or identity diversities might add a bonus to the team's performance [14] . creativity can take numerous forms in terms of expression, and the underlying cognitive mechanisms are elusive in their own right [45] . homophily is a robust phenomenon in social networks, yet recent work shows that as diversity increases, people can paradoxically perceive social groups as more similar [46] . naturally, our work does not encompass every possible combination of scenarios that can emerge in such complex systems. rather, we show one set of empirical evidences in support of our arguments linking the three interdisciplinary components, towards filling an important void in literature. we find evidence that demographic cues can indeed bias creativity-centric social network dynamics, which in turn can systematically influence the creative outcomes therein. as our creativity playground, we chose a text-based task in the alternate uses test. alongside the widely documented construct and predictive validities [47, 48] , this test has the added benefit that it allows us to employ the modern natural language processing tools to quantify and contrast creative outcomes robustly. the use of a bipartite network structure helped us keep the egos' stimuli sets uniform and thus track the dynamic link formation and persistence patterns in a clean manner. the use of avatars and social-media-like interaction interfaces further allowed us to overcome the natural challenges in meeting the complex experimental setup requisites. these insights can help make informed interventions in social systems where people's creative outcomes are sought after. for instance, consider the modern social media outlets, where people often follow the highly creative peers in their domains in hopes for getting inspirations for novel ideas. their choices of who to follow can naturally be biased by the demographic similarity with the peers as well. using our insights, algorithmic interventions can be made to help people diversify their creative stimulation sources. such measures can work to guard against the issues of inter-follower semantic similarities that we uncover, towards optimizing the network-wide creative outcomes. our work is not without limitations. the steep costs associated with collecting the data prohibited us from obtaining an even larger dataset. also, the study lasted for 5 rounds, which can be prohibitively short for capturing the full temporal effects in a creativity-centric social system. longitudinal studies with closer-to-life creative challenges and larger time-spans might generate elaborate insights on our research questions, which remain part of our future work. in this experiment, we are interested in divergent creativity, which deals with a person's ability to come up with or explore many possible solutions to a given problem [49] . we use a customized version of guilford's alternate uses test [50] , the canonical approach for quantifying divergent creative performances 1 . in each of the 5 rounds, the participants were instructed to consider an everyday object (e.g., a brick), whose common use was stated (e.g., a brick is used for building). the participants needed to come up with alternative uses for the object: uses that are different than the given use, yet are appropriate and feasible. we choose the first 5 common objects from the form b of guilford's test as the ideation objects in the 5 rounds. there were two trials in the study. in the first trial, the ideas of 6 alters were used as stimuli for 72 egos each in the control and treatment conditions. in the second trial, the other 6 alters acted as the stimulation sources for 18 egos each in the two conditions. all the alters and egos were assigned their roles and conditions randomly. each of turn-1 and turn-2 allowed the egos 3 minutes to submit their ideas. in turn-2, the egos in the control condition were shown only the pseudo usernames and the lists of ideas of their followee alters. the egos in the treatment condition were additionally shown the gender (male and non-male) and race (white and non-white) information using text and avatars (see si appendix). the avatars were used to ensure uniform visual depiction for all of the alters of the same demographic group, so as not to bias the egos by any facial, personality or other visual cues. the egos were instructed not to resubmit any of the alters' exact ideas, and that only non-redundant ideas would contribute to their performance. they were also told that there will be a short test at the end of the study, where they will need to recall the ideas shown to them. this was to ensure that the participants paid attention to the stimuli ideas, which has been shown to positively impact ideation performances [11, 51, 52, 53] . after turn-2, the egos rated all the ideas of the 6 alters in their trial on a 5-point likert scale (1: not novel, 5: highly novel) [54, 55] . as the egos optionally rewired their network connections to have an updated list of which alters to follow, they were required to submit the rationale behind their choices of updating/not updating links in each round. this was in place to make the egos accountable for their choices, which has been shown to raise epistemic motivation and improve systematic information processing [55, 56] . the participants were paid $10 upon the completion of the tasks, as well a bonus of $5 if they were among the top 5 performers in groups of 18. against the pool of ideas submitted by one's peers, the number of non-redundant ideas that a participant comes up with is a widely accepted marker of his/her creativity [57, 58] . the intuition being, to be creative, an idea has to be statistically rare. first, we filtered out inappropriate submissions that did not meet the requirements of being feasible and different from the given use. then, all the ideas submitted in a given round by all the participants were organized so that the same ideas are binned or collected together. we followed the coding rules described by bouchard and hare [59] and the rules specified in the scoring key of guilford's alternate uses test, form b, for binning the ideas. once all the ideas were binned, we computed the non-redundant idea counts by looking at the statistical rarity of the ideas submitted by the participants. namely, an idea was determined to be non-redundant if it was given by at most a threshold number of participants in a given pool of ideas. for the alters, the threshold was set to 1, and the pools were set to be the round-wise idea-sets of the 6 alters in the given trial. in the classic framework of the exponential random graph model (ergm), the observed network (i.e., the data collected by the researcher) is regarded as one realization out of a set of possible networks originating from an unknown stochastic process we wish to understand. the range of possible networks, and their probability of occurrence under the model, is represented by a probability distribution on the set of all possible graphs with the same number of nodes as the observed network. against these possible networks, we can then ask whether the observed network shows strong tendencies for structural characteristics that cannot be explained by random chance alone [60] . the basic expression for the classic (static) ergm model can be written as, here, y is the random variable for the state of the network (adjacency matrix), with a particular realization y. x denotes the vector of exogenous attribute variables, while x is the vector of observed attributes. β ∈ r p is a p × 1 vector of parameters. g(y, x) is a p-dimensional vector of model statistics for the corresponding network y and attribute vector x. κ is a normalizing quantity which ensures that eq. 1 is a proper probability distribution. unfortunately, evaluating κ exactly is non-trivial. therefore, we need to resort to numerical methods to approximate the coefficientsβ. namely, we use markov chain monte carlo methods to simulate draws of y, and from those draws we estimate the coefficients using maximum likelihood estimation (mcmc-mle method). such estimation methods make it convenient to transform eq. 1 to the following equivalent conditional log-odds form: here, y c ij denotes all the observations of ties in y except y ij . ∆ ij (y, x) is the change statistic, which denotes the change in the value of the network statistic g(y, x) when y ij changes from 1 to 0. this emphasizes the log-odds of an individual tie conditional on all other ties. for our temporal network data, we employed an extension of the static ergm that deals with dynamic networks in discrete time: the separable temporal ergm (stergm) [35] . in contrast to static ergms, here we fitted two models: one for the underlying relational formation, and another for the relational persistence. in going from a network y t at time t to a network y t+1 at time t + 1, the formation and persistence of ties are assumed to occur independently of each other within each time step (hence 'separable'), to be captured by the two models respectively. the governing equations for the formation and persistence models, analogous to eq. 2, are then written respectively as: log p (y ij,t+1 = 1|y c ij , x, y ij,t = 1) p (y ij,t+1 = 0|y c ij , x, y ij,t = 1) here, time indices have been added to the equations unlike before, as well as new conditionals. in the formation model in eq. 3, the expression is conditional on the tie not existing at the previous time step, whereas in the persistence model in eq. 4, it is conditional on the tie existing. figure 2 summarizes these intuitions. there are separate coefficient vectors β f and β p for the formation and persistence models respectively, as well as separate change statistics ∆ ij,f (y, x) and ∆ ij,p (y, x) for the two models. note that in the literature, it is common to refer to the persistence model as the 'dissolution' model instead. however, given how eq. 4 is set up, and given that positive coefficients in this model indicate link persistence rather than dissolution, we take the liberty to refer to the model as the persistence model. since our network is bipartite, we considered links to form between 'actors' i (egos) and 'events' j (alters). to that end, we employed one endogenous and three exogenous features. namely, we used the number of edges as the endogenous feature, which controls for network density: g 1 (y, x) = ij y ij = n e . as exogenous features, we included: 1. the alters' creative performances (i.e., non-redundant idea counts, x (score) ): g 2 (y, x) = ij y ij x (score) j 2. gender-based homophily between the egos and alters: g 3 (y, x) = ij y ij i{x race-based homophily between the egos and alters: g 4 (y, x) = ij y ij i{x where i{·} denotes the indicator function. these four features constitute the network statistic g(y, x), which is then used in computing the change statistics in the eqns. 3 and 4. note that the fitted coefficients β f and β p are conditional log-odds ratios, so their exponentials can intuitively be interpreted as the factors by which the odds of the formation and persistence of the network ties change respectively. for our implementation, we used the tergm package available within the statnet suite in r [36] . to semantically compare the idea-sets of the egos, we first removed stop words and punctuation marks to convert the idea-sets to bag-of-words documents. we represented each document by taking the word2vec embeddings of all of the words in the document, and computing the centroid of those embedded vectors. the centroid of a set of vectors is defined as the vector that has the minimum sum of squared distances to each of the other vectors in the set. this centroid is then used as the final document vector representation of the given idea-set [43] . word2vec is a popular word-embedding algorithm, which employs skip-gram with negative sampling to train 300-dimensional embeddings of words [42] . given two idea-sets, we computed their document vectors u and v, and estimated the similarity between the two vectors by taking their cosine similarity, the same idea can be phrased differently by different people. therefore, we made use of the manual binnings of ideas described in the quantifying creativity subsection, where all the different phrasings of the same idea were collected under a common bin id. to compare the sets of ideas generated by various demographic groups, we first collected the bin ids of ideas that were submitted uniquely by various demographic groups (i.e., male only, non-male only, white only, non-white only). under each bin id, all the different phrasings of the idea were collected in a bag-of-words document, with all stop-words and punctuation marks removed. similarly as before, we took the word2vec embeddings of the words in this document and computed their centroid to be the final vector representation of the idea. cosine similarity was used to compute the similarities between pairs of idea-vectors. humans have evolved specialized skills of social cognition: the cultural intelligence hypothesis the secret of our success: how culture is driving human evolution, domesticating our species, and making us smarter the big man mechanism: how prestige fosters cooperation and creates prosocial leaders individual motivations and network effects: a multilevel analysis of the structure of online social relationships network happiness: how online social interactions relate to our well being the cultural niche: why social learning is essential for human adaptation toward collaborative ideation at scale: leveraging ideas from others to generate more creative and diverse ideas comparing different sensemaking approaches for large-scale ideation cognitive stimulation and interference in idea generating groups toward more creative and innovative group idea generation: a cognitive-social-motivational perspective of brainstorming how the group affects the mind: a cognitive model of idea generation in groups creativity in dynamic networks: how divergent thinking is impacted by one's choice of peers the social side of creativity: a static and dynamic social network perspective the diversity bonus: how great teams pay off in the knowledge economy opinion: gender diversity leads to better science the diversity-innovation paradox in science getting specific about demographic diversity variable and team performance relationships: a metaanalysis birds of a feather: homophily in social networks creativity: theories and themes: research, development, and practice toward understanding the impact of artificial intelligence on labor track how technology is transforming work what can machine learning do? workforce implications a future that works: automation, employment, and productivity unpacking the polarization of workplace skills upskilling together: how peer-interaction influences speaking-skills development online coronavirus may mean automation is coming sooner than we thought large teams develop and small teams disrupt science and technology the effects of racial diversity congruence between upper management and lower management on firm productivity cognitive team diversity and individual team member creativity: a cross-level interaction the paradox of diversity management, creativity and innovation. creativity and innovation management dynamic social networks promote cooperation in experiments with humans adaptive social networks promote the wisdom of crowds. proceedings of the national academy of buildup of speaking skills in an online learning community: a network-analytic exploration a separable model for dynamic networks tergm: fit, simulate and diagnose models for network evolution based on exponential-family random graph models vox populi (the wisdom of crowds) aggregated knowledge from a small number of debates outperforms the wisdom of large crowds network dynamics of social influence in the wisdom of crowds improving collective estimations using resistance to social influence explaining creativity: the science of human innovation distributed representations of words and phrases and their compositionality speech and language processing: an introduction to speech recognition, computational linguistics and natural language processing the aligned rank transform for nonparametric factorial analyses using only anova procedures your creative brain: seven steps to maximize imagination, productivity, and innovation in your life as diversity increases, people paradoxically perceive social groups as more similar the relationships among two cloze measurement procedures and divergent thinking abilities prediction of academic achievement with divergent and convergent thinking and personality variables theories of creativity. the cambridge handbook of creativity alternate uses: manual of instructions and interpretation cognitive and social comparison processes in brainstorming groups, teams, and creativity: the creative potential of idea-generating groups modeling cognitive interactions during group brainstorming divergent and convergent group creativity in an asynchronous online environment motivated information processing, social tuning, and group creativity motivated information processing and group decision-making: effects of process accountability on information processing and decision quality give your ideas some legs: the positive effect of walking on creative thinking shining (blue) light on creative ability size, performance, and potential in brainstorming groups an introduction to exponential random graph (p*) models for social networks key: cord-020885-f667icyt authors: sharma, ujjwal; rudinac, stevan; worring, marcel; demmers, joris; van dolen, willemijn title: semantic path-based learning for review volume prediction date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_54 sha: doc_id: 20885 cord_uid: f667icyt graphs offer a natural abstraction for modeling complex real-world systems where entities are represented as nodes and edges encode relations between them. in such networks, entities may share common or similar attributes and may be connected by paths through multiple attribute modalities. in this work, we present an approach that uses semantically meaningful, bimodal random walks on real-world heterogeneous networks to extract correlations between nodes and bring together nodes with shared or similar attributes. an attention-based mechanism is used to combine multiple attribute-specific representations in a late fusion setup. we focus on a real-world network formed by restaurants and their shared attributes and evaluate performance on predicting the number of reviews a restaurant receives, a strong proxy for popularity. our results demonstrate the rich expressiveness of such representations in predicting review volume and the ability of an attention-based model to selectively combine individual representations for maximum predictive power on the chosen downstream task. multimodal graphs have been extensively used in modeling real-world networks where entities interact and communicate with each other through multiple information pathways or modalities [1, 23, 31] . each modality encodes a distinct view of the relation between nodes. for example, within a social network, users can be connected by their shared preference for a similar product or by their presence in the same geographic locale. each of these semantic contexts links the same user set with a distinct edge set. such networks have been extensively used for applications like semantic proximity search in existing interaction networks [7] , augmenting semantic relations between entities [36] , learning interactions in an unsupervised fashion [3] and augmenting traditional matrix factorization-based collaborative filtering models for recommendation [27] . each modality within a multimodal network encodes a different semantic relation and exhibits a distinct view of the network. while such views contain relations between nodes based on interactions within a single modality, observed outcomes in the real-world are often a complex combination of these interactions. therefore, it is essential to compose these complementary interactions meaningfully to build a better representation of the real world. in this work, we examine a multimodal approach that attempts to model the review-generation process as the end-product of complex interactions within a restaurant network. restaurants share a host of attributes with each other, each of which may be treated as a modality. for example, they may share the same neighborhood, the same operating hours, similar kind of cuisine, or the same 'look and feel'. furthermore, each of these attributes only uncovers a specific type of relation. for example, a view that only uses the location-modality will contain venues only connected by their colocation in a common geographical unit and will prioritize physical proximity over any other attribute. broadly, each of these views is characterized by a semantic context and encodes modality-specific relations between restaurants. these views, although informative, are complementary and only record associations within the same modality. while each of these views encodes a part of the interactions within the network, performance on a downstream task relies on a suitable combination of views pertinent to the task [5] . in this work, we use metapaths as a semantic interface to specify which relations within a network may be relevant or meaningful and worth investigating. we generate bimodal low-dimensional embeddings for each of these metapaths. furthermore, we conjecture that their relevance on a downstream task varies with the nature of the task and that this task-specific modality relevance should be learned from data. in this work, -we propose a novel method that incorporates restaurants and their attributes into a multimodal graph and extracts multiple, bimodal low dimensional representations for restaurants based on available paths through shared visual, textual, geographical and categorical features. -we use an attention-based fusion mechanism for selectively combining representations extracted from multiple modalities. -we evaluate and contrast the performance of modality-specific representations and joint representations for predicting review volume. the principle challenge in working with multimodal data revolves around the task of extracting and assimilating information from multiple modalities to learn informative joint representations. in this section, we discuss prior work that leverages graph-based structures for extracting information from multiple modalities, focussing on the auto-captioning task that introduced such methods. we then examine prior work on network embeddings that aim to learn discriminative representations for nodes in a graph. graph-based learning techniques provide an elegant means for incorporating semantic similarities between multimedia documents. as such, they have been used for inference in large multimodal collections where a single modality may not carry sufficient information [2] . initial work in this domain was structured around the task of captioning unseen images using correlations learned over multiple modalities (tag-propagation or auto-tagging). pan et al. use a graph-based model to discover correlations between image features and text for automatic image-captioning [21] . urban et al. use an image-context graph consisting of captions, image features and images to retrieve relevant images for a textual query [32] . stathopoulos et al. [28] build upon [32] to learn a similarity measure over words based on their co-occurrence on the web and use these similarities to introduce links between similar caption words. rudinac et al. augment the image-context graph with users as an additional modality and deploy it for generating visual-summaries of geographical regions [25] . since we are interested in discovering multimodal similarities between restaurants, we use a graph layout similar to the one proposed by pan et al. [21] for the image auto-captioning task but replace images with restaurants as central nodes. other nodes containing textual features, visual features and users are retained. we also add categorical information like cuisines as a separate modality, allowing them to serve as semantic anchors within the representation. graph representation learning aims to learn mappings that embed graph nodes in a low-dimensional compressed representation. the objective is to learn embeddings where geometric relationships in the compressed embedding space reflect structural relationships in the graph. traditional approaches generate these embeddings by finding the leading eigenvectors from the affinity matrix for representing nodes [16, 24] . with the advent of deep learning, neural networks have become increasingly popular for learning such representations, jointly, from multiple modalities in an end-to-end pipeline [4, 11, 14, 30, 34] . existing random walk-based embedding methods are extensions of the random walks with restarts (rwr) paradigm. traditional rwr-based techniques compute an affinity between two nodes in a graph by ascertaining the steadystate transition probability between them. they have been extensively used for the aforementioned auto-captioning tasks [21, 25, 28, 32] , tourism recommendation [15] and web search as an integral part of the pagerank algorithm [20] . deep learning-based approaches build upon the traditional paradigm by optimizing the co-occurrence statistics of nodes sampled from these walks. deepwalk [22] uses nodes sampled from short truncated random walks as phrases to optimize a skip-gram objective similar to word2vec [17] . similarly, node2vec augments this learning paradigm with second-order random walks parameterized by exploration parameters p and q which control between the importance of homophily and structural equivalence in the learnt representations [8] . for a homogeneous network, random walk based methods like deepwalk and node2vec assume that while the probabilities of transitioning from one node to another can be different, every transition still occurs between nodes of the same type. for heterogeneous graphs, this assumption may be fallacious as all transitions do not occur between nodes of the same type and consequently, do not carry the same semantic context. indeed, our initial experiments with node2vec model suggest that it is not designed to handle highly multimodal graphs. clements et al. [5] demonstrated that in the context of content recommendation, the importance of modalities is strongly task-dependent and treating all edges in heterogeneous graphs as equivalent can discard this information. metapath2vec [6] remedies this by introducing unbiased walks over the network schema specified by a metapath [29] , allowing the network to learn the semantics specified by the metapath rather than those imposed purely by the topology of the graph. metapath-based approaches have been extended to a variety of other problems. hu et al. use an exhaustive list of semantically-meaningful metapaths for extracting top-n recommendations with a neural co-attention network [10] . shi et al. use metapath-specific representations in a traditional matrix factorization-based collaborative filtering mechanism [27] . in this work, we perform random walks on sub-networks of a restaurant-attribute network containing restaurants and attribute modalities. these attribute modalities may contain images, text or categorical features. for each of these sub-networks, we perform random walks and use a variant of the heterogeneous skip-gram objective introduced in [6] to generate low-dimensional bimodal embeddings. bimodal embeddings have several interesting properties. training relations between two modalities provide us with a degree of modularity where modalities can be included or held-out from the prediction model without affecting others. it also makes training inexpensive as the number of nodes when only considering two modalities is far lower than in the entire graph. in this section, we begin by providing a formal introduction to graph terminology that is frequently referenced in this paper. we then move on to detail our proposed method illustrated in fig. 1 . formally, a heterogeneous graph is denoted by g = (v, e, φ, σ) where v and e denote the node and edge sets respectively. for every node and edge, there exists mapping functions φ(v) → a and σ(e) → r where a and r are sets of node types and edge types respectively such that |a + r| > 2. for a heterogeneous graph g = (v, e, φ, σ), a network schema is a metagraph m g = (a, r) where a is the set of node types in v and r is the set of edge types in e. a network schema enumerates the possible node types and edge types that can occur within a network. a metapath m(a 1 , a n ) is a path on the network schema m g consisting of a sequence of ordered edge transitions: we use tripadvisor to collect information for restaurants in amsterdam. each venue characteristic is then embedded as a separate node within a multimodal graph. in the figure above r nodes denote restaurants, i nodes denote images for a restaurant, d nodes are review documents, a nodes are categorical attributes for restaurants and l nodes are locations. bimodal random walks are used to extract pairwise correlations between nodes in separate modalities which are embedded using a heterogeneous skip-gram objective. finally, an attention-based fusion model is used to combine multiple embeddings together to regress the review volume for restaurants. let g = (v, e) be the heterogeneous graph with a set of nodes v and edges e. we assume the graph to be undirected as linkages between venues and their attributes are inherently symmetric. below, we describe the node types used to construct the graph (cf. figs. 1 and 2 and use the penultimate layer output as a compressed low-dimensional representation for the image. since the number of available images for each venue may vary dramatically depending on its popularity, adding a node for every image can lead to an unreasonably large graph. to mitigate this issue, we cluster image features for each restaurant using the k-means algorithm and use the cluster centers as representative image features for a restaurant, similar to zahálka et al. [35] . we chose k = 5 as a reasonable trade-off between the granularity of our representations and tractability of generating embeddings for this modality. the way patrons write about a restaurant and the usage of specialized terms can contain important information about a restaurant that may be missing from its categorical attributes. for example, usage of the indian cottage cheese 'paneer' can be found in similar cuisine types like nepali, surinamese, etc. and user reviews talking about dishes containing 'paneer' can be leveraged to infer that indian and nepali cuisines share some degree of similarity. to model such effects, we collect reviews for every restaurant. since individual reviews may not provide a comprehensive unbiased picture of the restaurant, we chose not to treat them individually, but to consider them as a single document. we then use a distributed bag-ofwords model from [13] to generate low-dimensional representations of these documents for each restaurant. since the reviews of a restaurant can widely vary based on its popularity, we only consider the 10 most recent reviews for each restaurant to prevent biases from document length getting into the model. 6. users: since tripadvisor does not record check-ins, we can only leverage explicit feedback from users who chose to leave a review. we add a node for each of the users who visited at least two restaurants in amsterdam and left a review. similar to [25, 28, 32] , we introduce two kinds of edges in our graph: 1. attribute edges: these are heterogeneous edges that connect a restaurant node to the nodes of its categorical attributes, image features, review features and users. in our graph, we instantiate them as undirected, unweighted edges. 2. similarity edges: these are homogeneous edges between the feature nodes within a single modality. for image features, we use a radial basis function as a non-linear transformation of the euclidean distances between image feature vectors. for document vectors, we use cosine similarity to find restaurants with similar reviews. adding a weighted similarity edge between every node in the same modality would yield an extremely dense adjacency matrix. to avoid this, we only add similarity links between a node and its k nearest neighbors in each modality. by choosing the nearest k neighbors, we make our similarity threshold adaptive allowing it to adjust to varying scales of distance in multiple modalities. metapaths can provide a modular and simple interface for injecting semantics into the network. since metapaths, in our case, are essentially paths over the modality set, they can be used to encode inter-modality correlations. in this work, we generate embeddings with two specific properties: 1. all metapaths are binary and only include transitions over 2 modalities. since venues/restaurants are always a part of the metapath, we only include one other modality. 2. during optimization, we only track the short-range context by choosing a small window size. window size is the maximum distance between the input node and a predicted node in a walk. in our model, walks over the metapath only capture short-range semantic contexts and the choice of a larger window can be detrimental to generalization. for example, consider a random walk over the restaurant -cuisine -restaurant metapath. in the sampled nodes below, restaurants are in red while cuisines are in blue. optimizing over a large context window can lead to mcdonald's (fast-food cuisine) and kediri (indonesian cuisine) being placed close in the embedding space. this is erroneous and does not capture the intended semantics which should bring restaurants closer only if they share the exact attribute. we use the metapaths in table 1 to perform unbiased random walks on the graph detailed in sect. 3.2. each of these metapaths enforces similarity based on certain semantics. we train separate embeddings using the heterogeneous skip-gram objective similar to [6] . for every metapath, we maximize the probability of observing the heterogeneous context n a (v) given the node v. in eq. (3) , a m is the node type-set and v m is the node-set for metapath m. arg max θ v∈vm a∈am ca∈na (v) log p(c a |v; θ) the original metapath2vec model [6] uses multiple metapaths [29] to learn separate embeddings, some of which perform better than the others. on the dblp metapath-specific embeddings fig. 3 . attention-weighted modality fusion: metapath-specific embeddings are fed into a common attention mechanism that generates an attention vector. each modality is then reweighted with the attention vector and concatenated. this joint representation is then fed into a ridge regressor to predict the volume of ratings for each restaurant. bibliographic graph that consists of authors (a), papers (p) and venues (v), the performance of their recommended metapath 'a-p-v-p-a' was empirically better than the alternative metapath 'a-p-a' on the node classification task. at this point, it is important to recall that in our model, each metapath extracts a separate view of the same graph. these views may contain complementary information and it may be disadvantageous to only retain the best performing view. for an optimal representation, these complementary views should be fused. in this work, we employ an embedding-level attention mechanism similar to the attention mechanism introduced in [33] that selectively combines embeddings based on their performance on a downstream task. assuming s to be the set of metapath-specific embeddings for metapaths m 1 , m 2 , . . . , m n , following the approach outlined in fig. 3 , we can denote it as: we then use a two-layer neural network to learn an embedding-specific attention a mn for metapath m n : further, we perform a softmax transformation of the attention network outputs to an embedding-specific weight finally, we concatenate the attention-weighted metapath-specific embeddings to generate a fused embedding we evaluate the performance of the embedding fusion model on the task of predicting the volume (total count) of reviews received by a restaurant. we conjecture that the volume of reviews is an unbiased proxy for the general popularity and footfall for a restaurant and is more reliable than indicators like ranking or ratings which may be biased by tripadvisor's promotion algorithms. we use the review volume collected from tripadvisor as the target variable and model this task as a regression problem. data collection. we use publicly-available data from tripadvisor for our experiments. to build the graph detailed in sect. 3.2, we collect data for 3,538 restaurants in amsterdam, the netherlands that are listed on tripadvisor. we additionally collect 168,483 user-contributed restaurant reviews made by 105,480 unique users, of which only 27,318 users visit more than 2 restaurants in the city. we only retain these 27,318 users in our graph and drop others. we also collect 215,544 user-contributed images for these restaurants. we construct the restaurant network by embedding venues and their attributes listed in table 1 as nodes. bimodal embeddings. we train separate bimodal embeddings by optimizing the heterogeneous skip-gram objective from eq. (3) using stochastic gradient descent and train embeddings for all metapaths enumerated in table 1 . we use restaurant nodes as root nodes for the unbiased random walks and perform 80 walks per root node, each with a walk length of 80. each embedding has a dimensionality of 48, uses a window-size of 5 and is trained for 200 epochs. embedding fusion models. we chose two fusion models in our experiments to analyze the efficacy of our embeddings: 1. simple concatenation model: we use a model that performs a simple concatenation of the individual metapath-specific embeddings detailed in sect. 3.4 to exhibit the baseline performance on the tasks detailed in sect. 4. simple concatenation is a well-established additive fusion technique in multimodal deep learning [18, 19] . each of the models uses a ridge regression algorithm to estimate the predictive power of each metapath-specific embedding on the volume regression task. this regressor is jointly trained with the attention model in the attention-weighted model. all models are optimized using stochastic gradient descent with the adam optimizer [12] with a learning rate of 0.1. in table 2 , we report the results from our experiments on the review-volume prediction task. we observe that metapaths with nodes containing categorical attributes perform significantly better than vector-based features. in particular, categorical attributes like cuisines, facilities, and price have a significantly higher coefficient of determination (r 2 ) as compared to visual feature nodes. it is interesting to observe here that nodes like locations, images, and textual reviews are far more numerous than categorical nodes and part of their decreased performance may be explained by the fact that our method of short walks may not be sufficiently expressive when the number of feature nodes is large. in addition, as mentioned in related work, we performed these experiments with the node2vec model, but since it is not designed for heterogeneous multimodal graphs, it yielded performance scores far below the weakest single modality. a review of the fusion models indicates that taking all the metapaths together can improve performance significantly. the baseline simple concatenation fusion model, commonly used in literature, is considerably better than the best-performing metapath (venues -facilities -venues). the attention basedmodel builds significantly over the baseline performance and while it employs a similar concatenation scheme as the baseline concatenation model, the introduction of the attention module allows it to handle noisy and unreliable modalities. the significant increase in the predictive ability of the attention-based model can be attributed to the fact that while all modalities encode information, some of them may be less informative or reliable than others, and therefore contribute less to the performance of the model. our proposed fusion approach is, therefore, capable of handling weak or noisy modalities appropriately. in this work, we propose an alternative, modular framework for learning from multimodal graphs. we use metapaths as a means to specify semantic relations between nodes and each of our bimodal embeddings captures similarities between restaurant nodes on a single attribute. our attention-based model combines separately learned bimodal embeddings using a late-fusion setup for predicting the review volume of the restaurants. while each of the modalities can predict the volume of reviews to a certain extent, a more comprehensive picture is only built by combining complementary information from multiple modalities. we demonstrate the benefits of our fusion approach on the review volume prediction task and demonstrate that a fusion of complementary views provides the best way to learn from such networks. in future work, we will investigate how the technique generalises to other tasks and domains. mantis: system support for multimodal networks of in-situ sensors hyperlearn: a distributed approach for representation learning in datasets with many modalities interaction networks for learning about objects, relations and physics heterogeneous network embedding via deep architectures the task-dependent effect of tags and ratings on social media access metapath2vec: scalable representation learning for heterogeneous networks m-hin: complex embeddings for heterogeneous information networks via metagraphs node2vec: scalable feature learning for networks deep residual learning for image recognition leveraging meta-path based context for top-n recommendation with a neural co-attention model multimodal network embedding via attention based multi-view variational autoencoder adam: a method for stochastic gradient descent distributed representations of sentences and documents deep collaborative embedding for social image understanding how random walks can help tourism image labeling on a network: using social-network metadata for image classification distributed representations of words and phrases and their compositionality multimodal deep learning multi-source deep learning for human pose estimation the pagerank citation ranking: bringing order to the web gcap: graph-based automatic image captioning deepwalk: online learning of social representations the visual display of regulatory information and networks nonlinear dimensionality reduction by locally linear embedding generating visual summaries of geographic areas using community-contributed images imagenet large scale visual recognition challenge heterogeneous information network embedding for recommendation semantic relationships in multi-modal graphs for automatic image annotation pathsim: meta path-based top-k similarity search in heterogeneous information networks line: large-scale information network embedding study on optimal frequency design problem for multimodal network using probit-based user equilibrium assignment adaptive image retrieval using a graph model for semantic feature integration heterogeneous graph attention network network representation learning with rich text information interactive multimodal learning for venue recommendation metagraph2vec: complex semantic path augmented heterogeneous network embedding key: cord-234918-puunbcio authors: shalu, hrithwik; harikrishnan, p; das, akash; mandal, megdut; sali, harshavardhan m; kadiwala, juned title: a data-efficient deep learning based smartphone application for detection of pulmonary diseases using chest x-rays date: 2020-08-19 journal: nan doi: nan sha: doc_id: 234918 cord_uid: puunbcio this paper introduces a paradigm of smartphone application based disease diagnostics that may completely revolutionise the way healthcare services are being provided. although primarily aimed to assist the problems in rendering the healthcare services during the coronavirus pandemic, the model can also be extended to identify the exact disease that the patient is caught with from a broad spectrum of pulmonary diseases. the app inputs chest x-ray images captured from the mobile camera which is then relayed to the ai architecture in a cloud platform, and diagnoses the disease with state of the art accuracy. doctors with a smartphone can leverage the application to save the considerable time that standard covid-19 tests take for preliminary diagnosis. the scarcity of training data and class imbalance issues were effectively tackled in our approach by the use of data augmentation generative adversarial network (dagan) and model architecture based as a convolutional siamese network with attention mechanism. the backend model was tested for robustness us-ing publicly available datasets under two different classification scenarios(binary/multiclass) with minimal and noisy data. the model achieved pinnacle testing accuracy of 99.30% and 98.40% on the two respective scenarios, making it completely reliable for its users. on top of that a semi-live training scenario was introduced, which helps improve the app performance over time as data accumulates. overall, the problems of generalisability of complex models and data inefficiency is tackled through the model architecture. the app based setting with semi live training helps in ease of access to reliable healthcare in the society, as well as help ineffective research of rare diseases in a minimal data setting. the increasing adoption of electronic technologies is widely recognized as a critical strategy for making health care more cost-effective. smartphone-based m-health applications have the potential to change many of the modern-day techniques of how healthcare services are delivered by enabling remote diagnosis [1] , but it is yet to realize its fullest potential. there has been a paradigm shift in the research on medical sciences, and technologies like point-of-care diagnosis and analysis have developed with more custom-designed smartphone applications coming into prominence. due to the high rate of infection, with the total number of confirmed cases exceeding twenty million since its recent outbreak, covid-19 was chosen as the initial disease target for us to study. with studies confirming that chest x-rays are irreplaceable in a preliminary screening of covid-19, we started with chest x-rays as the tool to detect the presence of coronavirus in the patients. [2] chest x-ray is the primary imaging technique that plays a pivotal role in disease diagnosis using medical imaging for any pulmonary disease. classic machine learning models have been previously used for the auto-classification of digital chest images [3] [4] . reclaiming the advances of those fields to the benefit of clinical decision making and computeraided systems using deep learning is becoming increasingly nontrivial as new data emerge [5] [6] [7] , with convolutional neural networks (cnns) spearheading the medical imaging domain [8] . a key factor for the success of cnns is its ability to learn distinct features automatically from domain-specific images, and the concept has been reinforced by transfer learning [9] . however, the process of learning distinct features by standard supervised learning using convolutional neural networks can be computationally non-efficient and data expensive. the above methods become incapacitated when combined with a shortage of data. our approach represents a substantial conceptual advance over all other published methods by overcoming the problem of data scarcity using a one-shot learning approach with the implementation of a siamese neural network. contrasting to its counterparts, our method has the added advantage of being more generalizable and handles extreme class imbalance with ease. we leverage open chest x-ray datasets of covid-19 and various other diseases that were publicly available (refer datasets section) [10] . once a siamese network has been tuned, it can capitalize on powerful discriminative features to generalize the predictive power of the network not just to new data, but to entirely new classes from unknown distributions [11] [12] . using a convolutional architecture, we can achieve reliable results that exceed those of other deep learning models with near state-of-the-art performance on one-shot classification tasks. the world is being crippled by covid-19, an acute resolved disease whose onset might result in death due to massive alveolar damage and progressive respiratory failure [13] . a robust and accurate automatic diagnosis of covid-19 is vital for countries to prompt timely referral of the patient to quarantine, rapid intubation of severe cases in specialized hospitals, and ultimately curb the spread. the definitive test for sars-cov-2 is the real-time reverse transcriptase-polymerase chain reaction (rt-pcr) test. however, with sensitivity reported as low as 60-70% [14] and as high as 95-97% [15] , a meta-analysis concluded the pooled sensitivity of rt-pcr to be 89% [16] . these numbers point out false negatives to be a real clinical problem, and several negative tests might be required in a single case to be confident about excluding the disease [17] . a resource-constrained environment demands imaging for medical triage to be restricted to suspected covid-19 patients who present moderate-severe clinical features and a high pretest probability of disease, and medical imaging done in an early phase might be feature deficient [18] [19] . although the cause of covid-19 was quickly identified to be the sars-cov-2 virus, scientists are still working around the clock to fully understand the biology of the mutating virus and how it infects human cells [20] . all these calls for a robust pre-diagnosis method, which hopes to provide higher generalization, work efficiently with insufficient feature data, and tackles the problem of data scarcity. this is where our proposed method of data augmentation generative adversarial network (dagan) exploited by a convolutional siamese neural network with attention mechanism comes into the picture, exhibiting a state of the art accuracy and sensitivity. generative adversarial networks (gans) are deep learning based generative models which take root from a game theoretic scenario where two networks compete against each other as an adversary. the constituent network models -a generative network and a discriminative network play a zero-sum game. gan architecture paved way for sophisticated domain-specific data augmentation by treating an unsupervised problem as a supervised one, thus automatically training the generative model. data augmentation procedure is crucial in the training procedure of a deep learning model as it has proven to be an effective solution in tackling the problem of overfitting at numerous occasions. as the data could be made more generalized, by providing the same with suitable augmentation strategies. in the case of images, augmentation plays a crucial role. as to correctly identify and recognise specific features in the same, a diverse set of considerably different sets of images are required. image augmentation techniques are henceforth found in diversely different ways, ranging from simple transforms (rotation) to adversarial data multiplicative methods such the one we would be using for our purposes, called the data augmentation generative adversarial networks (da-gan). the purpose and uniqueness of dagan when compared to other types of gans , is the ability to generate distinctive augmented images for any given image sample while preserving the distinctive class features intact. a general network architecture of the same is provided in figure 2 . generator : the generator component of the dagan contains an encoder which provides a unique latent space representation for a given image and a decoder which generates an image given a latent representation. any given image is first passed through the encoder to attain the corresponding latent representation, to which a scaled noise ( usually sampled from a gaussian distribution ) is added to obtain a modified latent vector. the same is then passed through the decoder to obtain the corresponding augmented image. discriminator : the discriminator component of the dagan is similar to other gans, where the basic purpose of which is to perform a binary classification to tell apart the generated and real images. the discriminator takes as input a fake distribution (generated images) and a real distribution (images belonging to the same class). forming an ideal dataset for a typical multi-class classification task using standard supervised learning methods is quite difficult. in addition to class imbalance issues, data for certain tasks such as medical image analysis could rarely be collected to meet ideal standards. one-shot learning methods helps tackle these issues effectively. in the deep learning literature, siamese neural networks are typically used to perform one-shot learning. the siamese neural network is a pair of neural networks trying to learn discriminative features from a pair of data points from two different classes.in our case the siamese networks would consist of two twin convolutional neural networks which accept distinct inputs but are joined together by an energy function.the latent vector is the overall output from either of the twin neural networks, it is a unique and meaningful representation of individual images passed. in one shot learning the overall training objective is to obtain a vector valued function (neural network) which provides meaningful latent representation vectors to the each image passed. as any machine learning task, the one shot learning too has a loss function whose value conveys how close the network is in attaining optimal parameter values. in the case of siamese networks, the loss is a similarity measure between the latent vector outputs , enforced by a binary class label (like or unlike). the energy function takes as input the latent vectors formed by the cnn's at their last dense layer (for each pair input passed) and outputs an energy value. the overall goal during the training process (optimization) can now be conveyed in terms of the energy function. the energy (output of the energy function) of a like pair is minimised and between unlike pairs it is maximised. the typical energy functions used could be anything from a simple euclidean norm to a fairly advanced function such as the contrastive energy function. a typical example of the contrastive energy is explained in brief. the contrastive energy function takes in two vectors as input and in general performs the following computation. where, d w is the parameterized distance function as mentioned below y is the binary class label. l s and l d are functions chosen as per the task. data no matter how clean will have irrelevant features, many of the predictive or analytic tasks does not rely on all of the features present in raw data. one of the factors that sets us humans apart from computers is our instinct of contextual relevance while performing any of our day to day activities. our brains are adept at such tasks which makes us able to perform complex tasks quite easily. attention is a deep learning technique designed to mimic this very property of our brain. attention, as the name suggests is a methodology by which a neural network learns to selectively focus on relevant features and ignoring the rest. attention was first introduced in the branch of natural language processing (nlp) [21] , where it enabled contextual understanding for sequence to sequence models (ex: machine translation) which led to better performance of the same. attention mechanism in nlp solved the problem of vanishing gradients for recurrent neural networks and at the same time brought in feature relevance understanding which boosts performance. the revolutionary impacts of deep learning paved way for creation of more efficient network architectures such as the transformer (bert) [22] , which are widely applied these days. moreover attention has been applied to other fields related to deep learning such as the ones focusing on signal and visual processing. images are a very abstract from of data, they contain numerous amounts of patterns (features) which could be analysed using latest computational tools to gain understanding on them. for many machine learning tasks such as regression or classification, identifying features of contextual relavance would improve the model performance and simplify the task. the same is the case for machine learning applied to images. for most images, the regions could be broadly classified as background and objects, where objects are of prime focus and background doesn't contribute to inference. because of the same, knowing where to look and what regions to focus on while making an inference from images helps boost the performance of the model. convolutional neural networks(cnn) are one of the best feature extraction tools for images in today's deep learning literature, attention applied to convolutional features will help pick out relevant features of interest from the large pool of features extracted by a cnn. the outbreak of the covid-19 [23] pandemic and the increasing count of the number of deaths have captured the attention of most researchers across the world. several works have been published which aim to either study this virus or in a way aim to curb the spread. owing to the supremacy of computer vision and deep learning in the field of medical imaging, most of the researchers are using these tools as means to diagnose covid-19. chest x-ray (cxr) and computed tomography (ct) are the imaging techniques that play an important role in the detection of covid-19 [24] , [25] . as inferred from literature convolutional neural network (cnn) remains the preferred choice of researchers for tackling covid-19 from digitized images and several reviews have been carried out to highlight it's recent contributions to covid-19 detection [26] [28] . for example in [29] a cnn based on inception network was applied to detect covid-19 disease within computed tomography (ct). they achieved a total accuracy of 89.5% with specificity of 0.88 and sensitivity of 0.87 on their internal validation and a total accuracy of 79.3% with specificity of 0.83 and sensitivity of 0.67 on the external testing dataset. in [30] a modified version of the resnet-50 pre-trained network was used to classify ct images into three classes: healthy, covid-19 and bacterial pneumonia. their model results showed that the architecture could accurately identify the covid-19 patients from others with an auc of 0.99 and sensitivity of 0.93. also their model could discriminate covid-19 infected patients and bacteria pneumonia-infected patients with an auc of 0.95, recall (sensitivity) of 0.96. in [31] a cnn architecture called covid-net based on transfer learning was applied to classify the chest x-ray (cxr) images into four classes of normal, bacterial infection, non-covid and covid-19 viral infection. the architecture attained a best accuracy of 93.3% on their test dataset. in [32] the authors proposed a deep learning model with 4 convolutional layers and 2 dense layers in addition to classical image augmentation and achieved 93.73% testing accuracy. in [33] the authors presented a transfer learning method with a deep residual network for pediatric pneumonia diagnosis. the authors proposed a deep learning model with 49 convolutional layers and 2 dense layers and achieved 96.70% testing accuracy. in [9] the authors proposed a modified cnn based on class decomposition, termed as decompose transfer compose model to improve the performance of pre-trained models on the detection of covid-19 cases from chest x-ray images. imagenet pre-trained resnet model was used for transfer-learning and they also used data augmentation and histogram modification technique to enhance contrast of each image. their proposed detrac-resnet 18 model achieved an accuracy of 95.12%. in [34] the authors proposed a pneumonia chest x-ray detection based on generative adversarial networks (gan) with a fine-tuned deep transfer learning for a limited dataset. the authors chose alexnet, googlenet, squeeznet, and resnet18 are selected as deep transfer learning models. the distinctive observation drawn for this paper was the use of gan for generating similar examples of the dataset besides tackling the problem of overfitting. their work used 10% percent of the original dataset while generating the other 90% using gan. in [35] the authors presented a method to generate synthetic chest x-ray (cxr) images by developing an auxiliary classier generative adversarial network (acgan) based model. utilizing three publicly available datasets of ieee covid chest x-ray dataset [10] , covid-19 radiography database [36] and covid-19 chest x-ray dataset [37] the authors demonstrated that synthetic images produced by the acgan based model could improve the performance of cnn(vgg-16 in their case) for covid-19 detection. the classification results showed an accuracy of 85% with the cnn alone, and with the addition of the synthetically generated images via acgan the accuracy increased to 95% . thus having understood the advantages that gan offers on training models with relatively smaller datasets, in our research we implemented the dagan combined with the attention based siamese neural networks for getting the optimum results out of a relatively smaller dataset used for training our model [10] . for our experiments the application was build using android studio. mvvm (model view view-model) architecture has been used in the app, which helps in proper state management following the ui material design guidelines while building. for storage of the app data(like the local x-rays samples) and authentication, firebase is used. the deep learning model was trained using publicly available datasets to test for robustness of the same. some of the major issues with such datasets were lack of data, inherent noise features and class imbalance. in the proposed methodology, all three of these issues were tackled effectively. the smartphone application would pave way for improvement of the existing model and provide ease of access to state of the art disease diagnosis for common pulmonary diseases to everyone.a semi-live training scenario was build on the cloud which enables the model to imporve over time gradually without intervention. the android application acts as an accessible platform which assists doctors or patients in uploading the x-rays samples to be inferred by the deep learning model, and obtain corresponding diagnosis results, as seen in figure 5 . the application is as a cloud-user interface which enables wider accessibility and help the model imporve by the cloud build semi-live training scenario. the algorithm is deployed in a fas ( function as a service ), which gets triggered when a user uploads a sample. there are primarily 2 categories of users : doctors and patients. users under the doctor category would be verified and could act as a potential source for labeled training data. under the patient category the inference mechanism gets triggered which enables the backend model to provide the user with a diagnosis result for the uploaded sample. the backend deep learning architecture mainly consists of two deep learning models-the dagan for robust and effective data augmentation, followed by the convolutional siamese network with attention mechanism. the siamese network is proven to be data efficient through our experiments. both of these networks are pretrained on publicly available datasets. to obtain the pretrained dagan model , suitably processed x-ray images were provided with corresponding class labels. for the pretrained siamese network, visually variant augmented samples with in-class features preserved were generated using the dagan model. then these generated samples were paired up for all possible combinations. each of the pairs were assigned a binary label based on the classes on which the two images in a pair belonged to -0 if both images are from the same class of pulmonary diseases and 1 otherwise. the resulting dataset was then used to train the siamese network. a set of well labelled and noise free images are selected to be the standard dataset for comparison. during inference procedure one of the twin among the siamese network generates a latent vector for the uploaded image by a forward pass. the second twin generates a latent vector for an image in the standard dataset. the obtained latent vectors are then compared using an energy function. the energy values of all classes in the standard dataset are obtained using a similar procedure, and the class with the lowest average value is selected. the class thus selected becomes the diagnosis for the particular uploaded image. the diagnosis made is conveyed back to the user through an online database. x-ray images used by the backend model could show large variance due to a variety of reasons, which includes lighting condition while the picture is taken, the x-ray machine specifications or camera quality of the user's smartphone etc. since the challenges such as this due to data variation should be accounted in a real world scenario, a semi-live training scenario was introduced, which enables the model parameters of the pretrained model to further adapt to new or variant data. the scenario is triggered when sufficient amounts of data is obtained. both the datasets used in this study are publicly available. apart from the selection of the standard dataset, no specific dataset cleansing was done. training process was done on the data including even those images with inherent noise features present, the same helps in confirming the robustness of the proposed model. dataset-1 was published in 2018, the x-rays images obtained were part of clinical care conducted year to year from guangzhou medical center from 5,863 different patients. dataset-2 was published as an effort to give out relevant data for widespread studies that were conducted to tackle the covid-19 pandemic situation. test set size of datasets 1 and 2 were selected as 20% of images from each class. the training set was further enlarged using the dagan model to ensure a generalized training for the proposed model. a good amount of images were selected as the testing set for the proposed model, so as to robustly test and evaluate the proposed method. as per the split of 20%, data from each class of the two datasets were randomly selected to be included in the test set. for dataset(1) used for the binary classification task the test set consisted of 1170 images out of the 5860 images, as for dataset (2) used for the multiclass classification task the test set consisted of 180 images out of the 905 images. no generation of images were done on the testing set, as it is considered important to conduct model evaluation on real world data samples.since the testing set in both experiments are large, the confidence interval for the testing accuracy of the proposed model was calculated by assuming a gaussian distribution for the dataset proportion. testing accuracy [39] 2018 for our purposes, in order obtain a robust and deployable model, we combine both datasets and train a multiclass classification model which is robustly evaluated for performance. the testing set is selected to be 20% of images from each class, at random. the model thus obtained, achieved a testing accuracy of 97.8%. the validation set is selected to be 20% of images in each class, from the training set. figure 11 illustrates the large class separation found. the illustration (figures [12] [13] [14] ) shows how effective is the latent space representation so formed by training the model, in representing the lower dimensional projection of images. in the wake of the global pandemic preventive and therapeutic solutions are in the limelight as doctors and healthcare professionals work to tackle the threat, with diagnostic methods having extensive capabilities being the need of the hour. the covid-19 outbreak has caused an adverse effect in all walks of day to day life worldwide. fact remains that the spread of such a disease could have been prevented during the early stages, with the help of accurate methods of diagnosis. medical images such as x-rays and ct scans are of great use when it comes to disease diagnosis, particularly chest x-rays being pivotal in diagnosis of many common and dangerous respiratory diseases. radiologists can infer many crucial facts from a chest x-ray which can be put to use in diagnosing several diseases. today's ai methods that mimic disease diagnosis as done by radiologists could outperform any human radiologist, owing to the higher pattern recognition capabilities and the lack of the human element of error or inefficiency in turn paving way for extensive research in this area. a common and efficient method to employ would be to use a convolutional neural network (cnn) based classifier , which could accurately recognise patterns from images to make necessary predictions. the limitation of the same being the requirement of huge amounts of data to obtain a classifier model with enough generalizability and accuracy. metrics improvement of an existing model is a hard task since retraining process for a large deep learning model would be expensive in terms of time and computation, vulnerable to scalability issues in retraining. hence we adopted feature comparison based methods which are superior to feature recognising methods in these respects, exploiting a deep generative network for data augmentation. the model exhibited profound comparison metrics having very distinguishable dissimilarity indices. similar classes showed remarkably low indices ranging from 0.05 to 0.6 , while different classes had higher values lying between 1.98 and 2.67. these performance indices of dissimilarity and the the large gap between these classes consolidates the fact that our model is able to clearly demarcate and classify diseases with state of the art efficiency. the limitations of this study include the inevitable noise factor on the dataset used, to tackle the same a cloud based live training method has been employed which uses properly annotated and identified data from medical practitioners worldwide. the underlying method could be employed to detect several other diseases if necessary, modification required for the current model minimal as compared to any deep learning based backend systems. doctors and radiologists can leverage the ability of our application to make a reliable remote diagnosis, thereby saving considerable time which can be devoted to medication or prescriptive measures. due to the high generalisability and data efficiency of the method , the application could prove itself to be a great tool in not only in accurately diagnosing diseases of interest, but to also conduct crucial studies on emerging or rare respiratory conditions. state of telehealth ct imaging and differential diagnosis of covid-19 artificial neural network-based classification system for lung nodules on computed tomography scans lung cancer classification using neural networks for ct images learning transformations for automated classification of manifestation of tuberculosis using convolutional neural network lung pattern classification for interstitial lung diseases using a deep convolutional neural network computer aided lung cancer diagnosis with deep learning algorithms deep learning mohamed gaber : classification of covid-19 in chest x-ray images using detrac covid-19 image data collection one-shot learning of object categories. pattern analysis and machine intelligence one shot learning of simple visual concepts pathological fndings of covid19 associated with acute respiratory distress syndrome essentials for radiologists on covid-19: an update-radiology scientific expert panel radiology department preparedness for covid-19: radiology scientific expert panel sensitivity of chest ct for covid-19: comparison to rt-pcr variation in false-negative rate of reverse transcriptase polymerase chain reaction-based sars-cov-2 tests by time since exposure imaging profile of the covid-19 infection: radiologic findings and literature review covid-19: towards understanding of pathogenesis neural machine translation by jointly learning to align and translate attention is all you need the coronavirus disease 2019 (covid-19) imaging prole of the covid-19 infection: radiologic findings and literature review clinical features of patients infected with 2019 novel coronavirus in wuhan, china the role of imaging in the detection and management of covid-19: a review artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 a deep learning algorithm using ct images to screen for corona virus disease deep learning enables accurate diagnosis of novel coronavirus covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images an efficient deep learning approach to pneumonia classification in healthcare a transfer learning method with deep residual network for pediatric pneumonia diagnosis detection of coronavirus (covid-19) associated pneumonia based on generative adversarial networks and a fine-tuned deep transfer learning model using chest x-ray dataset covidgan: data augmentation using auxiliary classifier gan for improved covid-19 detection covid19 radiography database covid-19 chest x-ray dataset initiative large dataset of labeled optical coherence tomography (oct) and chest x-ray images identifying medical diagnoses and treatable diseases by image-based deep learning an efficient deep learning approach to pneumonia classification in healthcare classification of images of childhood pneumonia using convolutional neural networks a transfer learning method with deep residual network for pediatric pneumonia diagnosis predict pneumonia with chest x-ray images based on convolutional deep neural learning networks covidaid: covid-19 detection using chest x-ray new machine learning method for image-based diagnosis of covid-19 aidcov: an interpretable artificial intelligence model for detection of covid-19 from chest radiography images the custom python code and android app used in this study are available from the corresponding author upon reasonable request and is to be used only for educational and research purposes. key: cord-288024-1mw0k5yu authors: wang, wei; liang, qiaozhuan; mahto, raj v.; deng, wei; zhang, stephen x. title: entrepreneurial entry: the role of social media date: 2020-09-29 journal: technol forecast soc change doi: 10.1016/j.techfore.2020.120337 sha: doc_id: 288024 cord_uid: 1mw0k5yu despite the exponential growth of social media use, whether and how social media use may affect entrepreneurial entry remains a key research gap. in this study we examine whether individuals’ social media use influences their entrepreneurial entry. drawing on social network theory, we argue that social media use allows individuals to obtain valuable social capital, as indicated by their offline social network, which increases their entrepreneurial entry. we further posit the relationship between social media use and entrepreneurial entry depends on individuals’ trust propensity based on the nature of social media as weak ties. our model was supported by a nationally representative survey of 18,873 adults in china over two years. as the first paper on the role of social media on entrepreneurial entry, we hope our research highlights and puts forward research intersecting social media and entrepreneurship. social media, defined as online social networking platforms for individuals to connect and communicate with others (e.g., facebook), has attracted billions of users. an emerging body of literature suggests that social media enables entrepreneurs to obtain knowledge about customers or opportunities, mobilize resources to progress their ventures, and manage customer relationships after venture launch (cheng & shiu, 2019; de zubielqui & jones, 2020; drummond et al., 2018) . further, social media allows entrepreneurs to efficiently manage their online relationships and reinforce their offline relationships (smith et al., 2017; thomas et al., 2020; wang et al., 2019) . despite much research on the impact of social media on the launch and post-launch stages of the entrepreneurial process (bird & schjoedt, 2009; gruber, 2002; ratinho et al., 2015) , there is little research on the impact of social media on the pre-launch stage, the first of the three stages of the entrepreneurial process (gruber, 2002) . despite the popularity of social media, it remains unclear whether and how social media affects individuals at the prelaunch stage of the entrepreneurial process, given social media consists of weak ties and substantial noise from false, inaccurate or even fake information, which may or may not benefit its users. in this study, we aim to contribute to the literature by investigating whether individuals' social media use affects their entrepreneurial entry based on social network theory. we argue that a higher social media use will allow an individual to develop a larger online social network and accumulate a greater amount of social capital, which facilitates entrepreneurial entry. a larger social network may facilitate individuals' information and knowledge seeking activities (grossman et al., 2012; miller et al., 2006) , which have a significant impact on their ability to generate and implement entrepreneurial ideas in the pre-launch stage (bhimani et al., 2019; cheng & shiu, 2019; orlandi et al., 2020) . social media, unlike offline face-to-face social networks, allows a user to develop a large social network beyond their geographical area without incurring significant effort and monetary cost (pang, 2018; smith et al., 2017) . the large social network arising from social media further enables social media users to build larger offline networks beyond their geographical proximity. hence, we argue that individuals' social media use has a positive impact on their offline network, which facilitates their entrepreneurial entry. however, social media is dominated by weak ties, and individuals with low trust propensity may not trust other online users easily so they are cautious about online information and knowledge. thus, we propose that trust propensity, an individual's tendency to believe in others (choi, 2019; gefen et al., 2003) , moderates the relationship between social media use and entrepreneurial entry. fig. 1 displays the proposed model. we assessed the proposed model on a publicly available dataset of china family panel studies (cfps), which consists of a sample of nationally representative adults. our findings reveal that social media use https://doi.org/10.1016/j.techfore.2020.120337 received 8 august 2020; accepted 21 september 2020 has a positive impact on entrepreneurial entry with individuals' offline network serving as a partial mediator. further, the findings confirm that individuals' trust propensity moderates the relationship between their social media use and entrepreneurial entry, with the relationship becoming weaker for individuals with high trust propensity. our study makes several important contributions to the literature. first, we contribute to the emerging entrepreneurship literature on an individual's transition to entrepreneurship by identifying factors contributing to the actual transition (mahto & mcdowell, 2018) . the identification of social media use addresses mahto and mcdowell's (2018) call for more research on novel antecedents of individuals' actual transition to starting entrepreneurship. to the best of our knowledge, this is the first study on the role of social media on individuals' entrepreneurial entry using social network theory. the research on social media in entrepreneurship area has focused on post-launch phases of entrepreneurship (cheng & shiu, 2019; drummond et al., 2018; mumi et al., 2019) , while research on individuals at the pre-launch stage of the entrepreneurial process is lacking. second, our study specified a mechanism for the impact of individuals' social media use on entrepreneurial entry via their offline network and used instrumental variables to help infer the causality. yu et al. (2018 yu et al. ( , p. 2313 noted that "specifying mediation models is essential to the advancement and maturation of particular research domains. as noted, mathieu et al. (2008: 203) write, 'developing an understanding of the underlying mechanisms or mediators (i.e., m), through which x predicts y, or x → m → y relationships, is what moves organizational research beyond dust-bowl empiricism and toward a true science.'" third, we contribute to the limited stream of research in the entrepreneurship literature on the networking of individuals in the prelaunch phase which has focused on networking offline (dimitratos et al., 2014; johannisson, 2009; klyver & foley, 2012) . instead, we offer a clearer picture of networking for entrepreneurship by connecting the literature on online social media use (fischer & reuber, 2011; smith et al., 2017) with offline social networks and entrepreneurial entry. the paper is organized as follows. the next section, section 2, provides an overview of the social capital theory and associated literature used to construct arguments for hypothesis development. section 3, data and methods, reports the context, method, and the variables. section 4 reports the results of the statistical analysis, instrumental variable analysis to address endogeneity concerns, and an assessment of robustness checks. section 5 discusses the study findings, outlines key study limitations, and provides guidance for future research and section 6 concludes. social capital theory (rutten & boekema, 2007 ) is a popular theoretical framework among management scholars. more recently, the theory has been increasingly used by entrepreneurship scholars to explain behaviors at the levels of both the individual (e.g., entrepreneurs) and firm (e.g., new ventures) (dimitratos et al., 2014; klyver & foley, 2012; mcadam et al., 2019) . according to the theory, the network of an individual has a significant influence on an individual's behavior (e.g., seeking a specific job) and outcomes (e.g., getting the desired job). in the theory, the network represents important capital, referred to as social capital, that produces outcomes valued by individuals (mariotti & delbridge, 2012) . social capital allows an individual to obtain benefits by virtue of their membership in the social network. the underlying assumption of social capital is, "it's not what you know, it's who you know" (woolcock and narayan (2000) , p. 255). for example, people with higher social capital are more likely to find a job (granovetter, 1995) or progress in their career (gabby & zuckerman, 1998) . for firms, social capital offers the ability to overcome the liability of newness or resource scarcity (mariotti & delbridge, 2012) . in entrepreneurship literature, scholars have used social capital to explain resource mobilization and pursuit of an opportunity by both entrepreneurs and small firms (dubini & aldrich, 1991; stuart & sorenson, 2007) . at the individual level, entrepreneurs embedded in a network are more likely to overcome challenges of resource scarcity and act promptly to launch a venture to capitalize on an opportunity (klyver & hindle, 2006) . for example, high social competence entrepreneurs establish strategic networks to obtain information, resources and more strategic business contacts (baron & markman, 2003) . mahto, ahluwalia and walsh (2018) supported the role of social capital by arguing that entrepreneurs with high social capital are more likely to succeed in obtaining venture capital funding. further, entrepreneurship scholars have argued that social networks influence entrepreneurs' decisions and the probability of executing a plan (davidsson & honig, 2003; jack & anderson, 2002; ratinho et al., 2015) . in women entrepreneurs, the presence of a robust social network is a key determinant of success (mcadam et al., 2019) . research suggests that the extent of a social network determines which resources entrepreneurs can obtain (jenssen & koenig, 2002; witt, 2004) . in the entrepreneurial context, scholars have also examined the influence of social networking at the firm level. for example, new and small firms often use a strong social network to overcome the liability of newness or smallness to pursue growth opportunities (galkina & chetty, 2015; mariotti & delbridge, 2012) . entrepreneurial ventures with limited resources often rely on their networks to obtain information and knowledge about consumers, competitors and networks in a foreign market (lu & beamish, 2001; wright & dana, 2003; yeung, 2002) . in the internationalization context, it is almost impossible for entrepreneurial firms to enter a foreign market without a robust social network (galkina & chetty, 2015) . it is well documented that new firms commonly use strategic networking for resources and capabilities (e.g., research and development) unavailable within the firm. the research on social networks in the entrepreneurship area is robust, but is focused almost exclusively on traditional offline social wang, et al. technological forecasting & social change 161 (2020) 120337 networks with limited attention to the dominant online social media. as offline social networks and online social networks differ significantly in terms of strength of ties (i.e., weak ties vs. strong ties) between network associates (filiposka et al., 2017; rosen et al., 2010; subrahmanyam et al., 2008) , empirical findings from traditional offline social networks may not be applicable to online social networks because offline social networks are dominated by strong ties while online social media are dominated by weak ties (filiposka et al., 2017) , and strong ties are based on a high degree of trust and reciprocity while weak ties have low trust and reciprocity. this significantly limits our understanding of entrepreneurial phenomena in the context of online social media. further, the research on social networks has also paid limited attention to the pre-launch phase of the entrepreneurial process, focusing mostly on entrepreneurs and established entrepreneurial ventures. finally, as offline social networks, which have strong ties, are the main context of the literature, the role of individual trust propensity remains unexplored as well. this offers a unique opportunity to investigate the role of social media and individuals' trust propensity in the pre-launch phase of the entrepreneurial process. the widespread adoption of the internet has led to an exponential growth in social media around the world. we refer to social media as "online services that support social interactions among users through greatly accessible and scalable web-or mobile-based publishing techniques" (cheng & shiu, 2019, p. 38) . social media, using advanced information and communication technologies, offers its users the ability to connect, communicate, and engage with others on the platform (bhimani et al., 2019; kavota et al., 2020; orlandi et al., 2020) . some of the most popular social media companies in the world are facebook, twitter, qq, and wechat. the large number of users coupled with other benefits of social media platforms, such as marketing, engagement, and customer relationship management, have attracted firms and organizations to these platforms. for example, firms have used social media to build an effective business relationship with their customers (steinhoff et al., 2019) , create brand loyalty (helme-guizon & magnoni, 2019), and engage in knowledge acquisition activities (muninger et al., 2019) . firms have also started adopting social media to enhance their internal operations by strengthening communication and collaboration in teams (raghuram et al., 2019) . thus, social media and its impact on firms and their environment has intrigued business and management scholars driving growth of the literature. recently, entrepreneurship scholars have begun exploring the impact of social media on entrepreneurial phenomena. limited research on social media in entrepreneurship suggests that social media allows entrepreneurial firms to enhance exposure (mumi et al., 2019) , mobilize resources (drummond et al., 2018) , and improve innovation performance (de zubielqui & jones, 2020) . this limited research, while enlightening, is devoted almost entirely to the post-launch stage of the entrepreneurial process, where a start-up is already in existence. the impact of social media on other stages of the entrepreneurial process, especially the launch stage (i.e., entrepreneurial entry), remains unexplored and is worthy of further scholarly exploration. for example, even though we know that social media can offer new effectual pathways for individuals by augmenting their social network, whether social media influences entrepreneurial entry or offline social networks remains unexplored. thus, our goal in this study is to address the gap in our understanding of the impact of social media on entrepreneurial entry. a social network refers to a network of friends and acquaintances tied with formal and informal connections (barnett et al., 2019) , that can exist both online and offline. social media is useful for creating, expanding and managing networks. research suggests social media can be used to initiate weak ties (e.g., to start a new connection) and manage strong ties (i.e., to reinforce an existing connection) (smith et al., 2017) . similar to social interactions in a physical setting, people can interact with others and build connections in the virtual world of social media, which eliminates the need for a physical presence in the geographical proximity of the connection target. the lack of requirement for geographical proximity with the in-built relationship management tools in social media allows a user to connect with a significantly larger number of other users regardless of their physical location. the strength of relationships among connected users in social media is reflected by the level of interaction among them; users in a strong connection have a higher level of interaction and vice versa. however, given the probability of a much larger number of connections in social media, dominance of weak ties is accepted. when connected users, either online or offline, in a network reinforce their connection by enhancing their level of interaction in both mediums (i.e., offline and online), they strengthen ties. for example, when two connected users in social media engage in offline activities, they may enhance their offline social tie through the joint experience . research also informs that social media use helps reinforce or maintain the strength of relationships among offline friends (thomas et al., 2020) . social media allows people to communicate with their offline friends instantly and conveniently without the need to be in geographical proximity (barnett et al., 2019) . the opportunity to have a higher level of interaction at any time regardless of physical location offers social media users the ability to manage and enlarge their offline social network. further, social media can also be used to initiate offline ties directly. in the digital age, users can connect their friends and acquaintances to other friends and acquaintances on social media. social media platforms also recommend connections to users based on their user profile, preferences, and online activities to generate higher user engagement. for example, in china, when a user intends to connect with a person known to their friends or connections, they can ask their friends for a wechat name card recommendation. once connected online, users can extend their connection to their offline networks as well. as a result, higher social media use may enhance a user's offline social network. thus, we hypothesize: h1. social media use of a user is positively associated with their offline social network. entrepreneurship, a context-dependent social process, is the exploitation of a market opportunity through a combination of available resources by entrepreneurs (shane & venkataraman, 2000) . the multistage process consists of: (a) the pre-launch stage, involving opportunity identification and evaluation, (b) the launch stage, involving business planning, resource acquisition, and entrepreneurial entry, and (c) the post-launch stage, involving venture development and growth (gruber, 2002) . our focus in this study is on entrepreneurial entry, which is the bridge between the pre-launch and launch stages of the entrepreneurial process, representing the transition from an individual to an entrepreneur (mahto & mcdowell, 2018; yeganegi et al., 2019) . entrepreneurial entry requires a viable entrepreneurial idea (i.e., opportunity) and resources (ratinho et al., 2015; ucbasaran et al., 2008) . individuals' social networks are important for researching and assessing entrepreneurial ideas (fiet et al., 2013) and accumulating valuable resources for entrepreneurial entry (grossman et al., 2012) . research suggests that networks play a crucial role in the success of entrepreneurs and their ventures (galkina & chetty, 2015; holm et al., 1996) . social networks allow individuals to access information and resources (chell & baines, 2000) . a larger social network allows entrepreneurs and smes to overcome resource scarcity for performance enhancement and expansion, especially international expansion (dimitratos et al., 2014; johannisson, 2009 ). although enlightening, the prior research on social networks in entrepreneurship has focused only on the traditional offline networks. in the digital age, social media has emerged as the key networking tool and enhanced individuals' ability to significantly enlarge their network and draw a higher social capital. these platforms allow entrepreneurs to efficiently manage both their online and offline networks and relationships . social media has significantly expanded the ability of individuals to network by removing geographical, cultural and professional boundaries. it allows people, separated by physical distance, to overcome the distance barrier to network and manage relations effectively (alarcóndel-amo et al., 2018; borst et al., 2018) . this is especially beneficial for an individual searching for entrepreneurial ideas that may be based on practices, trends, or business models emerging in the geographical locations of their network associates. as an example, jack ma of alibaba did not have to travel to the us to stumble upon the idea of an online commerce platform. social media allowed him to observe and obtain that information through network associates. while social media enlarges the social network of an individual with associates located beyond their geographical location, critics of the platform argue that such networks are mostly made up of weak ties lacking the strong ties of an offline network. however, individuals can still obtain useful and valuable information from abundant weak ties in such social networks (granovetter, 1973) . when accessing the network, the individuals have access to knowledge and information from various domains to inform their entrepreneurial ideas. further, the efficiency of social media allows for more effective and easy communications with distant individuals (alarcón-del-amo et al., 2018) . the improved communication with distant network associates allows individuals to strengthen their ties and obtain richer and reliable information. individuals may also obtain valuable access to new resources or new associates, who may support the formation of their new entrepreneurial venture. the distant network associates could also offer individuals additional resources in the form of entrepreneurial connections to new partners, buyers, suppliers, or talent, which all improve the chance of launching new ventures. it is well known that people, especially venture capitalists and investors, tend to minimize their risk by investing in known entrepreneurs rather than unknown entrepreneurs . thus, we believe social media use is beneficial for entrepreneurial entry. h2. social media use is positively associated with entrepreneurial entry. social media significantly enhances individuals' capability to expand their networks by removing cultural, geographical, and professional boundaries, to manage and strengthen offline social relationships. according to prior research, offline networks can provide the spatially proximate information and resources relevant to entrepreneurial entry (levinthal & march, 1993; miller et al., 2006) . social media enhances the efficiency and reduces the transaction cost of communication with offline network associates, allowing individuals to use them for information, knowledge and resource search. a recombination of information and knowledge is key to generating and then evaluating entrepreneurial ideas for entrepreneurial entry. in an offline social network, an individual has a stronger relationship with network associates because of their face-to-face interactions and collective experience in geographical proximity. further, geographical proximity in an offline social network facilitates relationships in real life by augmenting face-to-face interactions via virtual means (kim et al., 2019) . the additional channel of communication via virtual social media allows individuals to obtain timely and richer information, which may help them benefit from the collective wisdom and capability of their higher social capital (orlikowski, 2002) to develop entrepreneurial opportunities. the richer information and better access to knowledge and resources all benefit their entrepreneurial entry. thus, with higher social media use, individuals will have an expanded offline social network, which provides them the resources needed for successful entrepreneurial entry. therefore, we propose: h3. the offline social network mediates the relationship between social media use and entrepreneurial entry. trust propensity refers to an individual's tendency to trust others (choi, 2019; gefen et al., 2003) . trust propensity is a stable personality trait formed early in life through socialization and life experience (baer et al., 2018; warren et al., 2014) . like other ingrained personality traits, it affects an individual's behaviour, especially trust, in many situations (baer et al., 2018; friend et al., 2018) . for example, a customer with a high trust propensity is more likely to trust a salesperson without doubting their integrity (friend et al., 2018) . while trust propensity enables trust, it may leave individuals vulnerable due to reduced monitoring and reduced flow of new ideas (molina-morales et al., 2011) . furthermore, an individual with a high trust propensity may be inclined to obtain information from others indiscriminately and be locked into relationships. this may influence the individual's information processing capability. in the literature, trust propensity has attracted the attention of scholars seeking to explain not only the offline behavior of individuals, but also online behavior in social media platforms and virtual communities (lu et al., 2010; warren et al., 2014) . in social media, network associates are mostly connected through weak ties representing lack of trust and reciprocity. the existence of significant weak ties in social media makes the role of individual trust propensity critical. we believe trust propensity in social media moderates the impact of individuals' social media use on entrepreneurial entry by influencing their ability to network with strangers and known associates. further, prior findings in the literature suggest that trust influences entrepreneurial information searching and processing (keszey, 2018; molina-morales et al., 2011; wang et al., 2017) . this supports the possibility of trust propensity as the moderator of the link between social media use and entrepreneurial entry. in social media, the trust propensity of an individual influences their interaction and behavior (lu et al., 2010) . accordingly, an individual with a high trust propensity is more inclined to trust others. however, the trust in the relationship may not be mutual as the transacting party may lack the same trust propensity. as a result, the individual may fail to generate identical trust from the other individual thereby limiting the benefits of the relationship. with the aid of social media, an individual has the ability to access a large network of weak ties with remote individuals. this may allow the individual to obtain and validate information crucial to formalizing and finalizing an entrepreneurial idea. however, the advantage of higher social capital from access to a large network on social media may be eroded when individuals have a high trust propensity due to multiple factors. first, the network associates of individuals on social media vary significantly in terms of their trust propensity. the variations in the trust propensity of associates may result in them providing information via social media that may not always be reliable. in particular, network associates with low trust propensity may be reluctant to share valuable information. individuals with high trust propensity will treat a network associate and the information they provide with trust and without suspicion (peralta & saldanha, 2014; wang et al., 2017) . as a result, social media users may be exposed to both true and false information from associates. thus, such individuals are more likely to experience greater obstacles in distinguishing reliable information from unreliable noise, thereby incurring significantly higher information and resource search costs. the higher cost may hinder formation and finalization of an entrepreneurial idea and may hamper entrepreneurial entry. alternatively, individuals with low trust propensity are more likely to be more cautious (choi, 2019) . such individuals, due to their cautious attitude, are less likely to experience noise in their information and resource search, and thus may find it easier to distinguish reliable information from w. wang, et al. technological forecasting & social change 161 (2020) 120337 unreliable information. as a result, the cost (i.e., monetary, labor, and time) of obtaining information and resources for such individuals is lower, which may significantly enhance the probability of entrepreneurial entry. second, in social interactions and transactions trust may trigger a lock-in effect (molina-morales et al., 2011) . the lock-in effect refers to a scenario where high trust propensity individuals interact only with a few trusted associates on social media. the lock-in effect prevents the individuals from benefiting from a higher social capital on social media. thus, a lock-in effect may significantly limit individuals' information and resource search to a limited number of associates, which may significantly impair development and formation of their entrepreneurial idea, and ultimately entrepreneurial entry. however, individuals with low trust propensity are less likely to suffer from the lock-in effect thereby increasing their probability of entrepreneurial entry. thus, we hypothesize: h4. trust propensity moderates the relationship between social media use and entrepreneurial entry. we tested our proposed model on a sample of adults in china, a country with the world's largest population and the second highest total gross domestic product. china provides a rich setting for examining the link between social media and entrepreneurial entry for multiple reasons. first, china has experienced exponential growth in entrepreneurship and private enterprise development unleashed by economic transition (he et al., 2019) . the resulting entrepreneurial intensity provides a suitable context for investigating entrepreneurial phenomena including entrepreneurial entry. second, in china the adoption and use of social media is widespread with the world's largest number of users of internet (li et al., 2020) . the major american-based social media platforms, such as facebook, twitter, and instagram, were inaccessible in china at the time of the study (makri & schlegelmilch, 2017) , and people in china use other social media, such as wechat, qq, and sina weibo, which mirror or are similar to the american social media platforms (li et al., 2020) . our data is from the surveys of china family panel studies (cfps). cfps is a nationally representative longitudinal survey conducted every two years since 2010 by the institute of social science survey at peking university (xie & hu, 2014) . the cfps covers 95% of the chinese population in 25 provinces, providing extensive individual-and familylevel economic and social life information. the data from cfps has been validated and used for research in entrepreneurship (barnett et al., 2019) and other fields (hou et al., 2020; sun et al., 2020) . the survey, first conducted in 2010, had three follow-up waves in 2012, 2014, and 2016. our study used data from the 2014 and 2016 waves, which started including variables on internet activities. the 2014 survey contains 37,147 observations from 13,946 families. we matched the samples in 2014 and 2016 through a unique identifier of the respondents. as our study focuses on the transition of an individual to an entrepreneur, we excluded respondents who had entrepreneurial entry, and our final study sample had 18,873 observations. entrepreneurial entry. the cpfs survey followed existing literature to operationalize entrepreneurial entry, an individual's entry into entrepreneurship, by whether (s)he started a business or became selfemployed (barnett et al., 2019; eesley & wang, 2017) . accordingly, in the study, entrepreneurial entry refers to whether the respondents became entrepreneurs within the two years between the 2014 and 2016 surveys. specifically, the cpfs surveys had a multiple choice question on employment information, where participants chose their current employment status among: (a) agricultural work for your family, (b) agricultural work for other families, (c) employed, (d) individual/private business/other self-employment, and (e) non-agricultural casual workers. we used option d to operationalize entrepreneurial entry, following barnett et al. (2019) . if the respondent did not choose option d in year 2014 but chose option d in year 2016, (s)he transitioned to self-employment in those two years, and we dummy coded this individual 1 on entrepreneurial entry. social media use. a primary use of social media on the internet is socializing (bhimani et al., 2019; hu et al., 2018) . social media is the main online platform where people connect to each other and share information (bahri et al., 2018) . the 2014 cpfs survey measured social media use by asking, "in general, how frequently do you use the internet to socialize?". the respondents selected an option from the following: (1) everyday, (2) 3-4 times per week, (3) 1-2 times per week, (4) 2-3 times per month, (5) once per month, (6) once per a few months, and (7) never. as the scale was inverted, we reverse recoded it as 8 minus the selected option to obtain the measure of social media use. offline social network. offline social network refers to an individual's network of associates in the real world. scholars have used a variety of measures to assess the social network of an individual, including the cost of maintaining the relationship (du et al., 2015; lei et al., 2015) . in china, the context of our study, a social network is composed primarily of family, friends, and close acquaintances (barnett et al., 2019) . an important means of maintaining such relationships is through exchanging gifts during important festivals, wedding and funeral ceremonies, and other occasions. thus, scholars have used gift expenses and receipts in the previous year to assess social networks in china (barnett et al., 2019; lei et al., 2015) . we focused only on expenses incurred on gifts as the cost of maintaining an offline social network. hence, we operationalized offline social networks by the question on "expenditure on gifts for social relations in the past 12 months" from the 2014 cpfs survey. given that the expenditure is an amount, we transformed it using its natural log (ln (expenditure + 1)) (lei et al., 2015) . trust propensity. following the guidance of previous studies (chen et al., 2015; volland, 2017) , the cpfs survey assessed trust propensity by a single item scale that asked the extent to which a respondent trusts others. the respondents indicated their preference on a 0-10 scale. the data for trust propensity is from the 2014 survey. controls. in statistical analysis, we controlled for respondent demographics such as gender, age, and education. as age can correlate to people's resource availability, experience, and willingness to assume risk in a nonlinear fashion, we followed prior research to include the squared term of age as a control variable (belda & cabrer-borrás, 2018) . given the possibility of personal and family income influencing an individual's ability to finance a start-up (cetindamar et al., 2012; edelman & yli-renko, 2010) , we included it as a control variable in the analysis. all control variables are from the 2014 survey. we report descriptive statistics along with correlations among the study variables in table 1 . table 1 shows there is significant correlation among study variables, with most of the correlation coefficients below 0.40. the negative correlation between age and social media use, at 0.58, is the only exception. given the reported correlation among study variables, we rule out the possibility of multicollinearity in the sample. we further confirmed our inference by calculating variance inflation factors (vif), which were well below the threshold level of 10 with the highest vif being 1.94. w. wang, et al. technological forecasting & social change 161 (2020) 120337 we used stata and spss to test our hypotheses. in the regression models, we used ordinary least squares regression to predict offline social network and logit regression to predict entrepreneurial entry. we report the results of hypothesis testing in table 2 . in the table, model 1 shows the impact of social media use on offline social network. the regression coefficient suggests that social media use has a positive and significant (β=0.039, p<0.01) influence on the offline social network consistent with hypothesis h1. thus, it provides support for h1. in table 2 , models 2 and 3 provide support for hypotheses h2 and h3. the results of model 2 show the main effect of social media use on entrepreneurial entry is significant (β=0.050, p<0.05), thus providing support for h2. in model 3, when we add offline social network, the coefficient of social media use decreases (β=0.047, p<0.05) and the coefficient of offline social network becomes significant (β=0.084, p<0.05). meanwhile, the chi-squared statistics suggest that the model improved significantly (δχ 2 =6.04, p<0.05). the results offer preliminary support for hypothesis h3 (baron & kenny, 1986) . we further confirm h3 by using the bootstrapping method due to its inherent advantages (hayes, 2013; kenny & judd, 2014; preacher & hayes, 2008) over the technique of baron and kenny (1986) . we apply bootstrapping with model 4 in spss process (hayes, 2013) . with 5000 bootstrapping samples, the results show that social media use has an indirect effect on entrepreneurial entry (β=0.0033, 95% confidence interval: 0.0008-0.0065) while the direct effect is also significant (β=0.0465, 95% confidence interval: 0.0066-0.0864). thus, the results support hypothesis h3. the moderating effect of trust propensity is also reported in model 5 of table 2 . in the table, the interaction of social media use and trust propensity is significant and negative (β=-0.017, p<0.05) along with a significant change from model 4 to model 5 (δχ 2 =4.66, p<0.05). this notes: n refers to the sample size. ⁎ p < 0.10; ⁎⁎ p < 0.05; ⁎⁎⁎ p < 0.01. notes: n refers to the sample size. standard errors in parentheses. ⁎ p < 0.10, ⁎⁎ p < 0.05, ⁎⁎⁎ p < 0.01 w. wang, et al. technological forecasting & social change 161 (2020) 120337 provides support for hypothesis h4. in fig. 2 , we depict the moderating effects, where social media use of high trust propensity individuals has a weaker impact on entrepreneurial entry. additionally, model 6 displays the results for all study variables, suggesting it is robust. we performed additional robustness checks by using alternative measurements for social media use and trust propensity. first, as social media is a communication channel on the internet, we used an item measuring the degree of importance of the internet as a communication channel as an alternative measure of social media use. the results of the analysis with alternative measures are in table 3 and are largely consistent with our original analysis except for the moderating effect of trust propensity. second, because a high trust propensity individual is more likely to trust others, and vice-versa for a low trust propensity individual, we used an alternative dichotomous measure of whether people are mostly trustworthy or cautious when getting along with others for trust propensity. the results of the analysis with the alternative measure of trust propensity are reported in table 4 and offer support for the moderating effect of trust propensity. we assessed endogeneity issues using the two-stage least squares instrumental variables (2sls-iv) approach. there is a possibility that social media use may not be fully exogenous and could be under the influence of certain unobservable characteristics that also influence offline social network. following prior literature (semadeni et al., notes: standard errors in parentheses. the sample size n varies because less missing values on the alternative measurement. social media use is measured with the item "how important is the internet as a communication path?" the answer is scored on a 1-5 scale from "very unimportant" to "very important". ⁎ p < 0.10, ⁎⁎ p < 0.05, ⁎⁎⁎ p < 0.01 w. wang, et al. technological forecasting & social change 161 (2020) 120337 2014), we treated social media use as an endogenous variable and reassessed our results on offline social network. in our model, we identified two instruments to investigate potential endogeneity issues. to investigate endogeneity, we used two instrumental variables (iv): (1) online work and (2) online entertainment. we operationalized the two ivs through the frequency of using the internet to work and the frequency of using the internet to entertain, respectively. first, as people can work or entertain on social media, we suggest that these two ivs are correlated with social media use and satisfy the correlation with the endogenous variable. second, the ivs should not be directly correlated with the error terms of estimations on offline social network because learning and entertainment are not the direct social activity but instead the users aim to learn and to entertain. hence, online learning and entertainment should not directly impact offline social network in a strong manner. empirically, in the first stage result in model 1, the results of the instruments on the potentially endogenous variable are, by and large, significant, suggesting the relevance of the instruments. also, the results of cragg-donald f-statistics show that the instruments are strong (f=9342.66). moreover, the results of overidentification estimations suggest that the instruments are exogenous (sargan statistics p=0.55) (semadeni et al., 2014) . thus, the results statistically suggest that both ivs satisfy the conditions of qualifications as ivs. last but not least, both durbin (p<0.01) and wu-hausman (p<0.01) tests confirm the endogeneity. the results of the iv estimation, reported in table 5 , are similar to the previous result. the outcomes of the two-stage estimations are consistent with the regression outcomes in the previous analysis. these outcomes empirically confirm that social media use positively affects offline social network, even after considering the endogeneity issues. despite social media being dominated by weak ties and the substantial noise of false, inaccurate or even fake information, our findings reveal that individuals with higher social media use tend to conduct entrepreneurial entry. it is consistent with the positive benefits of higher social capital or a larger social network (galkina & chetty, 2015; johanson & vahlne, 2009 ). our results suggest that higher social media use indicates a higher probability of a larger social media (online) network, which provides higher social capital that benefits entrepreneurial entry. our findings that the positive influence of offline social network on entrepreneurial entry is also due to the network effect extends the research on the offline social networks of entrepreneurs (chell & baines, 2000; dubini & aldrich, 1991; klyver & foley, 2012) . the literature suggests that social networks influence entrepreneurs' decision making and actions, and entrepreneurs require a strong social network to succeed in the entrepreneurial process (jenssen & koenig, 2002; witt, 2004) . our findings, using instrumental variable analysis, suggest that higher social media use enhances individuals' offline social networks. this finding is consistent with past evidence that users often used social networking sites to connect with family and friends (subrahmanyam et al., 2008) . unlike past studies that simply indicate an overlap between social media and offline network associates (mcmillan and morrison (2006) ), our instrumental variable analysis helps to establish the impact of online networks on offline networks, suggesting social media enhances offline networks and subsequently entrepreneurial entry. specifying mediation models is essential to the advancement of research domains and hence this study helps research on social media in entrepreneurship to further develop beyond its nascent stage (yu et al., 2018) . finally, our finding that trust propensity moderates the influence of social media use on individuals' entrepreneurial entry suggests that social media, which is dominated by weak ties and substantial noise from false, inaccurate or even fake information, is in fact beneficial to entrepreneurial entry. such benefit may be smaller for people who are notes: standard errors in parentheses. the sample size n varies because less missing values on the alternative measurement. trust propensity is measured with the item "in general, do you think that most people are trustworthy, or it is better to take greater caution when getting along with other people?". we code 1 for the answer "most people are trustworthy" and 0 for "the greater caution, the better". ⁎ p < 0.10, ⁎⁎ p < 0.05, ⁎⁎⁎ p < 0.01 table 5 the results of instrumental variable analysis (n = 18,873). more trusting. specifically, our findings indicate that an individual's trust propensity plays a critical role in their use of social media and the outcome they experience. our results have important implications for practice. first, as social media can help individuals build networks that help with business resources and information both locally and remotely, people can target social media to help refine and validate entrepreneurial ideas and secure much needed resources for entrepreneurial launch. second, as individuals' trust propensity enhances or hinders the positive role of social media on entrepreneurial entry, potential entrepreneurs may specifically aim to apply more caution to their online contacts to obtain higher benefit from social media use for entrepreneurial entry. finally, given the role of social media in entrepreneurship, social media platforms may more specifically promote and facilitate networking of individuals to increase the level of entrepreneurial activity that can be enhanced via social media. our study has limitations and offers opportunity for further inquiry. first, theoretically, we used social network theory, and another theoretical framework may identify other possible mechanisms. for instance, an identification based theory may argue that social media use's influence on entrepreneurial entry could also be attributed to identity change in individuals due to network associates as theorized by mahto and mcdowell (2018) . however, given the lack of information about network associates on social media, identity change may be a remote probability. empirically, we operationalized offline social networks using gift expenses that serve as a proxy for the offline social network. the large nationally representative survey we used contained only expenditure on family relationships, yet individuals also need to expend similarly on gifts, eating out, etc. to maintain relationships with work acquaintances, partners, clients, former school mates, distant relatives, etc. hence, the expenditure on other relationships may mirror the expenditure on family relationships captured by this survey. we acknowledge these limitations and call for future research to search for alternative measures of social networks in other datasets. third, we caution readers in generalizing the findings of our study outside of china due to the study sample. china is different from other countries in terms of its cultural, legal, and social environment, which may affect respondent behavior on social media and entrepreneurial launch. thus, we suggest scholars empirically examine our model in other cultures. our study addresses the effect of social media on the entrepreneurship process, especially the pre-launch phase, by assessing the link between social media use and entrepreneurial entry. we use social capital theory to explain the link between social media use and entrepreneurial entry. we further argue that this relationship is contingent on individuals' trust propensity. thus, individuals with low trust propensity are more likely to benefit from social media use for entrepreneurial entry compared to individuals with high trust propensity. we also find that social media use strengthens individuals' offline social networks, which further aids their entrepreneurial entry. in conclusion, a key message is that social media can help individuals' transition to entrepreneurship. and practice, journal of applied psychology, journal of small business management, and family business review, etc. raj serves on editorial review boards of family business review and international entrepreneurship and management journal. he is also an associate editor of the journal of small business strategy and journal of small business management. wei deng is a phd candidate major in organization management at school of management, xi'an jiaotong university. his research interests include social entrepreneurship, entrepreneurial bricolage, and female entrepreneurship. his research has been published in journal of business research, asia pacific journal of management, and others. stephen x. zhang is an associate professor of entrepreneurship and innovation at the university of adelaide. he studies how entrepreneurs and top management teams behave under uncertainties, such as the impact of major uncertainties in the contemporary world (e.g. covid-19 and ai) on people's lives and work. such research has also given stephen opportunities to raise more than us$1.5 million of grants in several countries. prior to his academic career, stephen has worked in several industries and has founded startups. w. wang, et al. technological forecasting & social change 161 (2020) 120337 examining the impact of managerial involvement with social media on exporting firm performance. int it's not you, it's them: social influences on trust propensity and trust dynamics knowledge-based approaches for identity management in online social networks does the utilization of information communication technology promote entrepreneurship: evidence from rural china the moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations beyond social capital: the role of entrepreneurs' social competence in their financial success necessity and opportunity entrepreneurs: survival factors social media and innovation: a systematic literature review and future research directions entrepreneurial behavior: its nature, scope, recent research, and agenda for future research from friendfunding to crowdfunding: relevance of relationships, social media, and platform activities to crowdfunding performance what the numbers tell: the impact of human, family and financial capital on women and men's entry into entrepreneurship in turkey networking, entrepreneurship and microbusiness behaviour the joint moderating role of trust propensity and gender on consumers' online shopping behavior how to enhance smes customer involvement using social media: the role of social crm characters' persuasion effects in advergaming: role of brand trust, product involvement, and trust propensity the role of social and human capital among nascent entrepreneurs how and when social media affects innovation in start-ups. a moderated mediation model micro-multinational or not? international entrepreneurship, networking and learning effects the impact of social media on resource mobilisation in entrepreneurial firms do social capital building strategies influence the financing behavior of chinese private small and medium-sized enterprises? personal and extended networks are central to the entrepreneurial process the impact of environment and entrepreneurial perceptions on venture-creation efforts: bridging the discovery and creation views of entrepreneurship social influence in career choice: evidence from a randomized field experiment on entrepreneurial mentorship search and discovery by repeatedly successful entrepreneurs bridging online and offline social networks: multiplex analysis social interaction via new social media: (how) can interactions on twitter affect effectual thinking and behavior? propensity to trust salespeople: a contingent multilevel-multisource examination social capital and opportunity in corporate r & d: the contingent effect of contact density on mobility expectations effectuation and networking of internationalizing smes. manag trust and tam in online shopping: an integrated model the strength of weak ties getting a job: a study of contacts and careers resource search, interpersonal similarity, and network tie valuation in nascent entrepreneurs' emerging networks transformation as a challenge: new ventures on their way to viable entities introduction to mediation, moderation, and conditional process analysis: a regression-based approach entrepreneurship in china consumer brand engagement and its social side on brand-hosted social media: how do they contribute to brand loyalty? business networks and cooperation in international business relationships returns to military service in off-farm wage employment: evidence from rural china what role does self-efficacy play in developing cultural intelligence from social media usage? electron the effects of embeddedness on the entrepreneurial process the effect of social networks on resource access and business start-ups networking and entrepreneurship in place the uppsala internationalization process model revisited: from liability of foreignness to liability of outsidership social media and disaster management: case of the north and south kivu regions in the democratic republic of the congo power anomalies in testing mediation trust, perception, and managerial use of market information. int urbansocialradar: a place-aware social matching model for estimating serendipitous interaction willingness in korean cultural context networking and culture in entrepreneurship do social networks affect entrepreneurship? a test of the fundamental assumption using large sample, longitudinal data do social networks improve chinese adults' subjective well-being? the myopia of learning the impact of social media on the business performance of small firms in china the internationalization and performance of smes from virtual community members to c2c e-commerce buyers: trust in virtual communities and its effect on consumers' purchase intention entrepreneurial motivation: a non-entrepreneur's journey to become an entrepreneur the diminishing effect of vc reputation: is it hypercompetition? time orientation and engagement with social networking sites: a cross-cultural study in austria, china and uruguay overcoming network overload and redundancy in interorganizational networks: the roles of potential and latent ties mediational inferences in organizational research: then, now, and beyond stories from the field: women's networking as gender capital in entrepreneurial ecosystems coming of age with the internet: a qualitative exploration of how the internet has become an integral part of young people's lives adding interpersonal learning and tacit knowledge to march's exploration-exploitation model the dark side of trust: the benefits, costs and optimal levels of trust for innovation performance investigating social media as a firm's signaling strategy through an ipo the value of social media for innovation: a capability perspective organizational technological opportunism and social media: the deployment of social media analytics to sense and respond to technological discontinuities knowing in practice: enacting a collective capability in distributed organizing can microblogs motivate involvement in civic and political life? examining uses, gratifications and social outcomes among chinese youth knowledge-centered culture and knowledge sharing: the moderator role of trust propensity asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models virtual work: bridging research clusters structuring the technology entrepreneurship publication landscape: making sense out of chaos online and offline social networks: investigating culturally-specific behavior and satisfaction regional social capital: embeddedness, innovation networks and regional economic development the perils of endogeneity and instrumental variables in strategy research: understanding through simulations the promise of entrepreneurship as a field of research embracing digital networks: entrepreneurs' social capital online online relationship marketing strategic networks and entrepreneurial ventures online and offline social networks: use of social networking sites by emerging adults depressive costs: medical expenditures on depression and depressive symptoms among rural elderly in china student loneliness: the role of social media through life transitions opportunity identification and pursuit: does an entrepreneur's human capital matter? the role of risk and trust attitudes in explaining residential energy demand: evidence from the united kingdom wechat use intensity and social support: the moderating effect of motivators for wechat use trust and knowledge creation: the moderating effects of legal inadequacy social media effects on fostering online civic engagement and building citizen trust and trust in institutions entrepreneurs' networks and the success of start-ups social capital: implications for development theory, research and policy changing paradigm of international entrepreneurship strategy an introduction to the china family panel studies (cfps). chin individual-level ambidexterity and entrepreneurial entry entrepreneurship in international business: an institutional perspective consequences of downward envy: a model of selfesteem threat, abusive supervision, and supervisory leader self-improvement key: cord-306654-kal6ylkd authors: li, yuhong; chen, kedong; collignon, stephane; ivanov, dmitry title: ripple effect in the supply chain network: forward and backward disruption propagation, network health and firm vulnerability date: 2020-10-10 journal: eur j oper res doi: 10.1016/j.ejor.2020.09.053 sha: doc_id: 306654 cord_uid: kal6ylkd a local disruption can propagate to forward and downward through the material flow and eventually influence the entire supply chain network (scn). this phenomenon of ripple effect, immensely existing in practice, has received great interest in recent years. moreover, forward and backward disruption propagations became major stressors for scns during the covid-19 pandemic triggered by simultaneous and sequential supply and demand disruptions. however, current literature has paid less attention to the different impacts of the directions of disruption propagation. this study examines the disruption propagation through simulating simple interaction rules of firms inside the scn. specifically, an agent-based computational model is developed to delineate the supply chain disruption propagation behavior. then, we conduct multi-level quantitative analysis to explore the effects of forward and backward disruption propagation, moderated by network structure, network-level health and node-level vulnerability. our results demonstrate that it is practically important to differentiate between forward and backward disruption propagation, as they are distinctive in the associated mitigation strategies and in the effects on network and individual firm performance. forward disruption propagation generally can be mitigated by substitute and backup supply and has greater impact on firms serving the assembly role and on the supply/assembly networks, whereas backward disruption propagation is normally mitigated by flexible operation and distribution and has bigger impact on firms serving the distribution role and on distribution networks. we further analyze the investment strategies in a dual-focal supply network under disruption propagation. we provide propositions to facilitate decision-making and summarize important managerial implications. we examine the disruption propagation in supply chains we use agent-based modeling to delineate disruption propagation behavior we explore the effects of forward and backward disruption propagation we propose management strategies based on modeling results we analyze investment strategies in a dual-focal supply network chain and operations management, badensche str. 50-51, 10825 berlin, germany; e-mail: divanov@hwr-berlin.de in today's tightly coupled supply chains, a disruption at either the supplier side or the customer side can easily wreak havoc across the entire supply chain network. during the covid-19 pandemic, the global supply chains face both supply shortage and demand shrink which might lead to simultaneous or sequential forward and backward propagations of disruptions. for instance, the pandemic caused the operations suspension in china in february and march 2020, which further disrupted us and european manufacturers and retailers because of supply shortage (ivanov, 2020; thomas, 2020) . additionally, the stay-at-home order during the covid-19 pandemic has caused demand disruption to the travel and tourism-related industries. then the disruption diffuses to airline companies, hotels, and restaurants and further negatively influences their associated supply companies (crs insight, 2020) . the diffusion of an operational disruption beyond its origin and across the entire network is termed as disruption propagation (basole & bellamy, 2014; bierkandt et al., 2014; garvey et al. 2015 , scheibe & blackhurst, 2018 , also known as the ripple effect (dolgui, ivanov, & sokolov, 2018; ivanov, sokolov, & dolgui, 2014) . the propagating effects make the impacts of a local disruption unpredictable, hence hard to prepare for and manage. traditional supply chain risk management normally starts with risk identification and ends with different strategies to manage the identified risks (craighead, blackhurst, rungtusanatham, & handfield, 2007) . this approach is effective in coping with existing or anticipated disruptions, but less effective in handling abrupt or unexpected ones. for the latter, it is important for firms to build resilience that allows the firms to best prepare for, quickly respond to, and recover from unexpected disruptions (chowdhury & quaddus, 2017; pettit, fiksel, & croxton, 2010) . in practice, choosing the optimal level of resilience is a critical decision to make, as over-capacity incurs unnecessary costs while under-capacity exposes firms to risks (fiksel, polyviou, croxton, & pettit, 2015) . a comprehensive understanding of supply chain disruption propagation and how it affects both individual firms and the whole supply chain can support various levels of decision-making in terms of resilience investment. from the preceding examples, disruptions can either propagate from the supplier side (forward disruption propagation), or the buyer side (backward disruption propagation) . also, in practice, mitigation strategies associated with disruptions from the supplier side and the demand side are distinctive. for example, kroger provides additional distribution channels, such as the service of ordering online and curbside pick-up, and contracts with instacart for grocery delivery service, to manage the demand disruption during the covid-19 pandemic. comparatively, to mitigate the supply disruption, firms normally increase stock level or look for substitute supply or backup suppliers. therefore, to effectively mitigate the disruption risks, it is critical to understand how different types of disruption propagation influence both the individual firm and the whole supply chain network. at the firm level, the impact of a local disruption varies with the firm's resilience and its position in the scn. this causes differences in firm vulnerability across the supply chain. understanding each firm's vulnerability level can guide proper firm-level resilience investment. at the network-level, the whole supply chain network performance is the integrated performance of individual firms inside the supply network, which can be measured by network healththe number of healthy (i.e., undisrupted) firms at a specific time point (basole & bellamy, 2014; . investigating how a local disruption affects the network health allows supply chain managers to allocate resources optimally across the scn, effectively manage disruption propagation, and achieve better network performance. although the supply chain research has shown an increasing interest in disruption propagation (basole & bellamy, 2014; dolgui, ivanov, & sokolov, 2018; marchese & paramasivam, 2013) , current literature is still limited in the following two aspects. first, the majority of current studies either focus on one specific direction of disruption propagationforward propagation along the material flow han & shin, 2016) or backward propagation in a reverse direction of the material flow or treat them with no difference (basole & bellamy, 2014; li, zobel, seref, & chatfield, 2020) . although a limited number of studies have considered both forward and backward disruption propagation (garvey, carnovale, & yeniyurt, 2015; ojha, ghadge, tiwari, & bititci, 2018) using the bayesian network approach in a simple supply chain structure context, the forward and backward propagations remain diversely separately perspectives. as these two types of disruption propagation mechanisms are distinctive in practice, we are motivated to consider both directions of the ripple effect and examine their marginal and joint effects on firm and scn performance from a complex network perspective. second, despite the mature literature on supply chain resilience, very few studies have examined the interplay of resilience investment between dual focal firms that share a common supply base subject to risk and disruption propagation. viewing firms embedded in a scn, we address the research gaps by answering the following two research questions (rqs). node-level vulnerability? rq2: how can a focal firm's resilience investment in its scn influence this focal firm itself and other focal firms in the network, given the existence of shared supply base? to address these questions, we first introduce a theoretical framework that illustrates the disruption propagation mechanism, which articulates the interplay between node-level influencing factors, network structure, disruption propagation, and associated investments. in this framework, we identify the origins and mechanisms of forward and backward disruption propagation. as forward and backward disruption propagation are associated with distinctive mitigation strategies, differentiating them can support effective decision-makings, especially in limited resources. we use the agent-based simulation to model the disruption propagation behavior and then conduct multi-level quantitative analyses based on the simulation data. our results show that the impacts of forward and backward disruption propagation, both on network-level health and node-level vulnerability, are distinctive and moderated by the network structure. thus, to effectively mitigate the disruption, practitioners should consider the resilience capacities related to different types of disruption propagation and network structure. additionally, we perform a game theoretical analysis to evaluate the influence of one focal firm's resilience investment on other focal firms. in contrast with the commonly used ego network, which includes only one focal firm, we investigate a supply network of dual focal firms that have commonly observed shared supply base (wang, li, & anupindi, 2015) . our results show that examining the broader industrial network beyond a firm's ego network and enhancing supply chain visibility can support better decision-makings to mitigate disruption risks. in this sense, we extend the traditional -triad‖ structure (choi & wu, 2009 ) that has one buyer and two suppliers to two buyers and one shared supply base. the game theoretical analysis indicates that one focal firm's investment decision should consider the benefit-cost ratios of both its own and the other focal firm. the remainder of the paper is organized as follows. section 2 presents a review of the literature on supply chain disruption propagation. the disruption propagation mechanism is described in detail in section 3. we design the experiment in section 4. we perform the empirical analysis and provide the propositions derived from the results in section 5. section 6 summarizes the managerial implications. we conclude the paper with a discussion on the contributions and limitation in section 7. supply chain disruption propagation, also known as the ripple effect, has drawn increasing academic interest recently, due to the significant global economic loss caused by various disruption events such as the 2011 thailand flood, the 2012 japan earthquake, and the 2020 covid-19 pandemic (ivanov, 2020) . disruption propagation / ripple effect refers to that an operational failure at one entity of the scn causes operational failures of other business entities (dolgui et al., 2018; nguyen & nof, 2019) . this concept is different from the bullwhip effect (dolgui, ivanov, & rozhkov, 2020; lee, padmanabhan, & whang, 1997) , as the bullwhip effect is triggered by small demand vulnerabilities but does not necessarily imply a severe operational failure (chatfield, hayya, & cook, 2013; x. wang & disney, 2016) . within the research scope of operational failure, there are studies on the ripple effect that mainly focus on downward disruption propagation from the supplier side (ivanov 2018; ivanov, sokolov, and dolgui 2014) , the snowball effects where impacts can transmit and get ampli-fied towards a larger number of firms in the supply chain (swierczek, 2016; świerczek, 2014) , the backward disruption propagation that means disruptional effects diffuse backwards opposite to the direction of the material flows , as well as the general disruption propagation both from supplier and demand sides (basole & bellamy, 2014; k. zhao, zuo, & blackhurst, 2019) . various approaches have been adopted in the current studies on disruption propagation. first of all, modeling and simulation methods are widely used in this field (ivanov, 2017) , including agent-based simulation from a complex network perspective (basole & bellamy, 2014; tang, jing, he, & stanley, 2016; , investigating risk propagation using bayesian network approaches (garvey & carnovale, 2020; garvey et al., 2015; hosseini, ivanov, & dolgui, 2019; ojha et al., 2018) , numerical models to simulate indirect effects in the global supply chain using the input-output model wenz et al., 2014) , the entropy approach to study the vulnerability of cluster scn during the cascading failures (zeng & xiao, 2014) , and other operations research methods (ivanov, pavlov, & sokolov, 2014; kinra, ivanov, das, & dolgui, 2019; liberatore, scaparra, & daskin, 2012; pavlov, ivanov, pavlov, & slinko, 2019; sinha, kumar, & prakash, 2020) . second, there are qualitative studies investigating disruption propagation from different aspects. ivanov (2018) and dolgui et al. (2018) addressed the ripple effect, analyzed major recent publications, and delineated research perspectives in the domain. scheibe & blackhurst (2017) provided theoretical insights into the risk propagation using the grounded theory case study approach. deng et al. (2019) , through a case study as well, explored risk propagation mechanisms and put forward the feasible countermeasures for perishable product supply chain to improve sustainability. at last, there are also several related empirical studies. goto, takayasu, & takayasu (2017) derived a stochastic function of risk propagation from comprehensive data of bankruptcy events in japan from 2006 to 2015. świerczek (2014) explored the relationship between supply chain integration and the snowball effect. zhang, chen, & fang (2018) surveyed 31 chinese firms involved in the auto-industry and explored the transmission of a supplier's disruption risk along the supply chain. the aforementioned studies have greatly enriched our understanding of the disruption propagation / ripple effect phenomenon. however, the literature falls short in two aspects. first, although a firm's operational failure can result from either its suppliers' or customers' disruption in practice, to the best of our knowledge, there are no recent studies considering the different impacts of forward and backward disruption propagation on the supply chain. a study considering both directions of disruption propagation is more comprehensive and realistic. understanding disruption propagation comprehensively is crucial to the identification of effective techniques for supply chain risk management. second, there are limited studies on the disruption propagation from a complex network perspective (basole & bellamy, 2014; . the majority of the current studies still focus on a simple supply chain structure, such as an ego network. research based on simple structures could not fully grasp the interaction between the network structure and disruption propagation mizgier, jüttner, & wagner, 2013; . for example, studies based on a single firm's ego network ignore the influence of disruptions originated outside of the ego network. motivated by the research gaps, we study the forward and backward disruption propagation and contribute to the literature in the following aspects. first, this study comprehensively delineates the disruption propagation mechanism, which differentiates between forward and backward disruption propagation and identifies the influential factors in detail. in the analysis, we find that forward and backward disruption propagation influences node vulnerability and network health in distinctive ways. thus, investment strategies to reduce both directions of propagation may differ significantly in practice. second, this study investigates disruption propagation from a complex network perspective. extending the traditional perspective of a firm's ego network, we examine a realistic industrial network with two focal firms. the industrial network more realistically reflects how network structure interacts with disruption propagation and how one focal firm should make resilience investment decisions subject to other focal firms' actions. based on the analysis of the industrial network, this study proposes to develop effective strategies for one focal firm to benefit from other focal firms' resilience investment. disruptions can diffuse along the material flow as well as in a reverse direction. such a complex behavior of disruption propagation in the supply chain network can originate from simple interaction rules among firms. to capture the disruption propagation, we characterize the basic interaction rules first in this section. in a dyadic buyer (j)-supplier (i) relationship, a disruption can diffuse either from a supplier to a buyer or from a buyer to a supplier. the forward disruption propagation refers to the disruption diffusion from supplier i to buyer j, along the material flow . the rate of forward disruption diffusion, , is defined as the probability of a disrupted buyer at the time point t+1 if the supplier is disrupted at time t. the forward disruption diffusion is a probability because one firm's disruption may not necessarily lead to another firm's disruption . for example, the same fire in philips' plant had almost no impact on nokia but caused huge loss to ericsson (norman & jansson, 2004) . this rate of forward disruption diffusion is affected by the following factors. -the nature of the supplier's disruption. this includes the type, severity, and length of the disruption. for example, a disruption caused by cyber-attacks may influence the buyer differently from one caused by the adverse weather. -the dependence of the buyer on the supplier. if the buyer is highly dependent on the supplier (for example, the buyer sources the key components solely from this supplier), the buyer tends to be more easily disrupted by the supplier. -the buyer's resilience capacity. if the buyer has a higher resilience capacity such as the higher safety stock, better supply chain visibility, or a quicker response plan, it is less likely to be impacted by a supplier's disruption. comparatively, backward disruption propagation refers to the disruption diffusion from buyer j back to supplier i, which passes through the adverse direction of the material flow . when the buyer suffers a disruption, the supplier may suspend its operations to avoid producing too many supplies that the buyer does not need. for example, when hp and dell faced production disruption during the 2011 thailand flood, the operations of intel that is hp and dell's supplier were also disrupted due to a lack of demand (intel, 2011) . the rate of backward disruption diffusion, , is affected by the following factors. -the nature of the buyer's disruption, including the type, severity, and length of the disruption. -the dependence of the supplier on the buyer. if the supplier is highly dependent on the buyer (for example the majority of the supplier's revenue comes from the buyer), the supplier tends to have a higher disruption diffusion rate from the buyer's disruption. -the supplier's resilience capacity. the higher resilience capacity the supplier has (such as higher operational flexibility and supply chain visibility), the supplier is less likely to be impacted by the buyer disruption. forward and backward disruption propagation differ in two main aspects. first, the disruption propagation rates are different as the resilience capacities against forward and backward disruption propagation are distinctive. the resilience capacity against backward disruption propagation mainly relies on its operation and distribution flexibility, while the resilience capacity against forward disruption propagation largely depends on the availability of substitute resources. second, the dependence between the buyer and the supplier is mostly asymmetric in reality. for example, a small supplier whose major business comes from walmart is highly dependent on walmart, but not vice versa. based on these differences, forward and backward disruption propagation should be modeled and evaluated differently to provide better decision-making to improve supply chain resilience. in a directed scn with multiple suppliers and buyers, the disruption probability of one particular business entity depends on its relationship with all disrupted neighbors, including both suppliers and buyers. we assume every node in the scn has two states, namely healthy (h) and disrupted (d). the disruption status of a node means that the firm suspends its operations due to reasons including but not limited to inventory stockout, labor strike, extreme weather and earthquakes. there is uncertainty that a healthy node i at time t can become disrupted at time t+1 under the influence of its disrupted suppliers and buyers at time t. for example, a firm with ample safety stock that faces a supplier disruption may less likely become disrupted than a firm with a low safety stock level. to take this into consideration, we model the status transition as a probability. for a firm with multiple disrupted neighbors, we assume the disruption impact from suppliers and buyers are independent. we acknowledge that this assumption cannot grasp the full picture, as the impacts of disruptions may be interdependent in practice ( however, the assumption of independence also provides two main benefits. first, this assumption holds in many real cases, so the derived results have important practical implications. for instance, a firm has several suppliers who have their own manufacturing. the quality of the suppliers' products (i.e., components) determines the final product quality. as the quality of one product is not affected by other suppliers' products, disruption impacts from suppliers are independent in this sense. second, this assumption is widely used in the risk management literature (qazi et al., 2017; zhao & freeman, 2019) , as it makes the model tractable. if the risk dependence estimation becomes complicated and difficult, more errors can occur, and relying on incorrect estimation can be costly in practice (zhao & freeman, 2019) . future studies can build on and relax the assumption of independence. in this way, for a given node, the transition probability from healthy status to disrupted status is: where stands for node i's transition probability from being healthy at time t to being disrupted at t+1; ( ) and ( ) stands for disrupted suppliers and disrupted customers of node i, respectively; represents the probability of forward propagation from supplier k to node i at time t; and indicates the probability of backward propagation from customer j to node i at time t. a disrupted node can recover and become healthy. let be the recovery probability of node i from being disrupted at time t to being healthy at time t+1, regardless of the statuses of suppliers and buyers. assuming the recovery is independent from the influences of suppliers and buyers, we model the transition probability of node i from being disrupted at time t to being healthy at t+1 as: formula (3.2) implies that when some of its suppliers and buyers are disrupted, supplier i's ability to recover will be discounted, and its recovery process will slow down. the complex disruption propagation behavior within the scn emerges from the aforementioned interaction rules among nodes. figure 1 depicts the disruption propagation mechanism. each individual node has its own specific fr, br, and rc values based on its dependency with the neighbors and its resilience capacity. these node-level influencing factors interact with the network structure, determine the nodes' transition probabilities, and ultimately shape the disruption propagation across the supply chain network. disruption propagation exerts effects on both node and network levels. at the node level, some nodes have more frequent disruptions than others during the disruption propagation process. supply chain managers should therefore be more concerned with those nodes. at the network level, the disruption varies with and is moderated by different node-level factors and the network structure. given a network structure, supply chain managers can control and mitigate the disruption propagation at both node and network levels through proper investment that can change node-level factors of fr, br and rc. based on the framework, we investigate the impacts of disruption propagation at both node and network levels by designing experiments that interact node-level factors with the network structure in the next section. the purpose of this study is to investigate how fr and br influence the ripple effect differently. given a supply chain network ( ), consisting of the set of nodes and of directed edges. each node has two possible status: 1 represents healthy and 0 represents disrupted. the transition probability of each node is determined by the current status of itself and all of its neighbors' current status as eq (3.1) and eq (3.2). to formulate such a problem, the model is set up as locally interacting markov chains, also known as probabilistic cellular automata (pca) (fernández, louis, & nardi, 2018) . the state space of such a model is the tensor product of the statuses of all the local markov chains, which is huge in our context. for example, for an individual node with 3 suppliers and 2 customers, there are in total combinations of statuses. enumerating the transition probabilities for all the possibilities is challenging. this might work for a small sized network, but it will eventually become infeasible when the size or complexity of the network grows (garvey & carnovale, 2020) . in fact, the area of pca acknowledges its complexity and suggests that it is used as a flexible modeling, such as agent-based modeling, and simulation framework in an applied context (fernández et al., 2018) . therefore, we implement agent-based simulation (abs) in this study for the following benefits. firstly, abs allows us to re-create and predict the performance of complex systems or phenomenon through simulating the simultaneous interactions of agents (basole & bellamy, 2014; zhang et al., 2020) , and the supply chain network is such a complex system. second, abs allows us to explore and investigate separate effects of factors and interactions among factors through numerical experiments under a variety of settings (zhang et al., 2020) . moreover, abs provides a visual and easy-tounderstand approach that both researchers and practical audience can comprehend, which leads to a broader practical prospect of application. this approach can also be extended to other purposes. for example, it can easily extend to the heterogeneous setting of fr, br and rc to observe the behavior of a detailed supply chain network. to explore the disruption propagation within the scn, we conduct our main analysis using the japan automotive industry scn with two focal firms. to make sure the findings can apply to boarder types of networks, we generate two comparable random networks as the robustness check. compared with studies using one focal firm's ego network, our setup allows us to compare the impacts of disruption propagation on different focal firms, as well as other entities inside the scn. we choose honda and toyota as the focal firms. both of them are the largest automobile makers in japan. their supply chains are typical and highly interacting complex networks exposed to various disruption risks (wagner and bode 2006). we construct the supply network using the bloomberg splc database and select the first-and second-tier suppliers of the focal firms. we select only cogs (cost of goods sold), which refers to direct costs of producing the goods sold, relationships where the percentage of cogs is over 0.5%, in order to get the most significant material flows. this gives us a network with 121 nodes and 193 links. there are 63 common suppliers at both tiers 1 and 2 for toyota and honda. we conduct experiments on the automobile supply network to discover how the node-level factors result in the disruption propagation and cause disruption at both node and network levels. -experimentation is a powerful methodology that enables researchers to establish causal claims empirically‖ (imai, tingley, & yamamoto, 2013) . by designing and implementing experiments, we are able to mitigate the concern of endogeneity (antonakis, bendahan, jacquart, & lalive, 2014), which is usually a challenging of causal inference in studies using observational data (ho, lim, reza, & xia, 2017) . in this study, we integrate experimental design with the agent-based simulation to obtain the data for analysis. agent-based simulation is capable to simulate interactive agents' behaviors in an attempt to understand complex phenomena (basole & bellamy, 2014; nair & vidal, 2011; . for the simulation, we set the initial disruption probability at 5%, which indicates the initial disruption is small and regional. for the node-level influencing factors, we focus on the aforementioned forward disruption diffusion rate , backward disruption diffusion rate , and recovery rate . due to companies' distinctive resilience capacities and different dependency levels between firms, disruption propagation varies by firm, time, and network in reality. however, we assume a homogeneous setting in which the values of the three parameters , , and are constant across agents and over time in our analysis. this is because although heterogeneous local settings can provide more accurate estimation of one particular case, it also offsets general findings of these influencing factors. as our main research objective is to investigate how these factors influence disruption propagation, we assume homoge-neous node-level factors to avoid noises of other factors . therefore, we denote , , and by , , and . based on formulas (3.1) and (3.2), for a node that has n disrupted suppliers and m disrupted customer, the integrated disruption probability at time t is ( ) ( ) for a healthy node, and the recovery probability at time t is ( ) ( ) for a disrupted node. we conduct a full factorial design. the value combinations of fr, br, and rc serve as experiment treatments. we control over the parameters as listed in table 1 . the parameter settings represent three different levels for each parameter: low, medium, and high. the networks receive one treatment condition at a time. to control for the impact of the random disruption, we conduct each experimental run of a treatment condition 50 times. each time we collect the status of each node (disrupted or healthy) at every time point. by averaging across the 50 times, we obtain node-level vulnerability and network-level health defined in section 5. in addition to the main analysis, to increase the robustness of the findings, we create two random networks with the same number of nodes and links and compare their results to the results of the japanese automotive industry network. we find the results of the two random networks are consistent with each other. for the simplicity of presentation, we only report results of one random network. figure 2 shows the plot of the reported random network. to begin, we visualize the network-level disruption propagation behavior in figure 3 using the number of healthy firms (basole & bellamy, 2014) with respect to different settings of parameters. we set the max period to be 1000, which is long enough for disruption propagation to reach a steady state. steps (x-axis) from figure 3 , first, we can observe that a small initial disruption can cause significant turmoil in the whole supply chain network. with an initial disruption probability at p = 0.05, the number of healthy nodes decreases in all scenarios. this means that more firms become disrupted after the initial disruption. second, the number of healthy nodes becomes steady after around 50 steps, which signifies that, for a given set of parameters, the disruption propagation tend to reach a steady state after a fixed number of periods (basole & bellamy, 2014) . thus, in the following analysis, we only focus on the steady state found from time stamps 101 to 1000. third, as expected, both forward and backward disruption diffusion rates are negatively associated with the number of healthy nodes, and the recovery rate is positively associated with it. however, it is still unclear how these influencing factors interact with the network structure to determine the performance of the supply chain and individual firms. it is also unclear how the resilience investment can improve the supply chain performance. to explore these questions, we conduct the following analyses. first, we investigate how the node influencing factors determine the supply chain network performance. we use two network-level ripple effect performances: network health and propagation period. network health describes the overall health status of the supply chain network. we follow basole & bellamy (2014) and measure it using the number of healthy nodes at the stable status, which is operationalized as the average number of healthy nodes from step 101 to 1000 in our case. propagation period is the number of periods the supply chain network used to disperse the disruption risks to a steady status, which describes the propagation speed that lower propagation period means quicker propagation speed. to rigorously demonstrate the effects of node-level influencing factors on network health and propagation period, we perform the ols regression analysis using the model 5.1.1 and model 5.1.2. the dependent variable, network health or propagation period, has been log-transformed to conform to the ols regression assumptions. to mitigate multi-collinearity due to the interaction effects, we have standvifs for all variables (including the interaction terms) are below 4, which is lower than the critical value of 10 (kutner, nachtsheim, neter, & li, 2005) , indicating that multicollinearity is not a concern. (5.1.1) the subscript i stands for each observation. table 2 shows the results. for this auto supply network, we find that backward risk diffusion rate reduces supply network health and propagation period (equivalent to increase propagation speed) at a higher rate than forward risk diffusion rate, as the absolute value of the coefficient of is much higher than that of . we conduct further analysis to look into the reasons that make the impacts of and vary. as the settings of and are the same, the firm disruption probability and recoverability is determined by the number of disrupted suppliers n and the number of disrupted customers m at each period. while the number of disrupted suppliers and customers largely depend on the network structure, we believe this difference in effects, both for network health and propagation period, may come from the network configuration. with this in mind, we compare the results of the automobile industry network with the results of the comparable random network. the auto supply network is a typical supply network with the characteristic that a majority of nodes have higher out-degree than in-degree and very few nodes has higher in-degree than out-degree. thus, the majority nodes are subject to higher impacts of and few nodes are subject to higher impacts of . as a result, the overall supply network is subject to higher impacts from than . comparatively, the random network has most nodes with balanced in-degree and out-degree, so the impacts of and on the network are similar. thus, we can derive the difference in the impacts of and comes from the network configuration, which is the distribution of in-degrees and out-degrees across nodes in the network. in general, a typical supply network is subject to higher impact of , a balanced network such as random network is subject to balanced impact of and , and a typical distribution network is subject to higher impact of . below we propose the following. in practice, supply chain managers not only concern about the performance of their supply networks, but also pay great attention to the performance of individual suppliers. in this section, we use node vulnerability to describes how vulnerable an individual node is when exposed to disruption risks inside the scn. in a specific network, some firms are more vulnerable than others, thus the supply chain manager should be more cautious of the vulnerable nodes and make proper investment to decrease their vulnerability. in our context, we measure node vulnerability as the percent of disrupted periods for a node after disruption propagation becomes stable. figure 4 depicts the overlay plot of node vulnerability when , where the dots stand for suppliers. figure 4 shows that some firms have higher vulnerability than other firms for this particular network in the scenario. examining other scenarios of different values of fr, br and rc, we find that the variation of node vulnerability is not in a consistent scale, that is, the variation of node vulnerability is high in the scenario of high fr and br. nonetheless, we find some nodes are consistently vulnerable across all scenarios. table 3 lists the firms whose vulnerability is above 90% of all the scenarios, as well as their centrality information. compared with the average degree of the supply network that is 1.60, average betweenness centrality of 30.24, and average closeness centrality of 0.369, these highly vulnerable firms have higher centrality measures in the supply network. considering that the scale of node vulnerability varies at different scenarios, we conjecture that node position in the network and nodelevel influencing factors can interact and affect node vulnerability. identifying the most vulnerable firms and understanding what contributes to the node vulnerability can support decision-making against disruptions in practice. for a given scn, changing a node's centrality is usually difficult, especially in a short time frame, as the centrality is determined by the node's market position, business nature, and competitor status. in this sense, to decrease the node vulnerability, supply chain managers can and are more interested in adjusting node-level influencing factors that interact with the centrality to affect node vulnerability. this requires us to understand how node-level influencing factors interact with the network structure to impact the node vulnerability. we investigate the following ols regression model and present in table 4 the results of both the auto industry network and a random network. in this analysis, we introduce two new variables to describe the node position in a network. the first variable is total degree ( ) as the measure of centrality. a robustness check shows the results using are consistent with those using betweenness or closeness centrality. the second variable is degree difference ( 1,749.5*** (df = 9; 3257) 1,582.5*** (df = 5; 3261) 1,118.1*** (df = 9; 3257) standard errors in parentheses * p < 0.05, ** p < 0.01, *** p < 0.001 table 4 clearly shows that both node-level factors and node position contribute to the node vulnerability. specifically, nodes with higher centrality ( ) tend to have higher vulnerability. forward and backward disruption diffusion rates (fr and br) positively impact node vulnerability. recovery rate (rc) negatively influences node vulnerability. besides, there is another interesting observation that the impacts of forward and backward risk diffusion rates on node vulnerability vary by the degree difference ( ). a node with a positive is more serving as an assembly role in the network, and then its vulnerability is more due to forward propagation (coefficient of is positive). practically, firms who play an important assembly role have more connections with suppliers than with customers. thus, those firms are more vulnerable to disruption propagation from the supplier side than from the customer side. therefore, to decrease the vulnerability, they should increase the safety stock level, build close relationships with suppliers, and invest in backup resources against forward disruption propagation. comparatively, a node with a negative has more links with customers and is more like to fulfill the tasks of distribution in the network. thus, its disruption mostly stems from backward propagation (coefficient of is negative). these firms should invest in monitoring market information closely and increasing production and operation flexibility against backward disruption propagation. based on the above analysis, we present the following propositions. network can interact and contribute to the node vulnerability. proposition 4a: a node with a positive degree difference (higher indegree than outdegree) is affected more by forward disruption propagation than by backward propagation. nodes with positive degree differences (out-degree higher than in-degree) should invest in alternative supply and sourcing. proposition 4b: a node with a negative degree difference (higher outdegree than indegree) is affected more by backward disruption propagation than by forward propagation. nodes with negative degree differences (out-degree lower than in-degree) should invest in operational and product flexibility. the purpose of understanding disruption propagation is to support effective decision-making of resilience investment against disruptions. in practice, the network-level resilience investment is usually initiated by focal firms, such as improving supply chain infrastructure, enhancing supply chain visibility, and improving cooperative and learning abilities. such kinds of investment can influence , and on every node inside the focal firm's ego network. in a dual-focal supply network, one focal firm's investment can influence both itself and the other focal firm, because of the existence of the shared supply base. thus, the investment decision of one focal firm may be subject to the influence of the other focal firm. in this section, we aim to discover how the investment of one focal firm influences itself, the other focal firm, and the overall scn health. then we discuss the implications about investment decisions of focal firms in a competitive environment. we consider three types of investment: honda-initiated investment, toyota-initiated investment, and collaborative investment. the honda-and toyota-initiated investment only influences the suppliers in their own ego supply chains, whereas the collaborative investment requires the collaboration of both focal firms and can influence the whole auto industry supply network. we set up a benchmark setting of . under this setting, there is enough room for improvement (i.e., to reduce fr and br) from the perspectives of both network and individual nodes. we assume the investment can decrease forward and backward disruption diffusion rates in the same pattern. for a given investment level, we have ( ) and ( ). the investment level ranges from 0 to 1. our numerical results are shown in figure 5 , which depicts how different investments influence the focal firm's vulnerability and the health of the whole supply network. from the perspective of network health, there is little difference between benefits from a honda initiated investment and from a toyota initiated investment. unsurprisingly, the collaborative investment has better performance than honda or toyota initiated investments, even though the marginal benefits vary by investment levels. from the perspective of focal firms, both honda and toyota can benefit from the other focal firm's investment, no matter whether they choose to invest in their own ego supply networks. this benefit comes from the fact that they have common suppliers. thus, one focal firm's investment can affect the disruption diffusion rates on the other firm's supply network by influencing common suppliers. from figure 5 , we further notice two factors that can affect the ability for a focal firm to benefit from the other focal firm's investment: the focal firm's relationship to the shared supply base and the investment level. first, the focal firm's relationship to the shared supply network affects the gain from the other's investment. for example, suppose both honda and toyota can choose to invest at level 0.8. as shown in table 5 , honda's gain from toyota's investment (0.96 -0.74 = 0.22) is different from toyota's gain from honda's investment (0.96 -0.86 = 0.1). having controlled the investment level and the disruption diffusion rates, we tease out the only variant -the focal firm's relationship to the shared supply network -as the contributing factor to such a difference in two firms' benefits. we advocate future research on the mechanisms of this interesting phenomenon. we can derive the strategy space through game theoretical analysis. figure 6 shows the strategies. in the foregoing analysis, we show that differentiating between the forward and backward disruption propagation can support effective decision-making to improve supply chain resilience as well as to reduce firm vulnerability. in this section, we provide the following managerial implications that intend to guide managers to better mitigate propagating disruptions based on the simulation findings. 1. managers should clarify the origin of the disruption and differentiate between the forward and backward disruption propagation. our reasoning and analytical results show that forward and backward disruption propagation is distinctive in the following aspects: the origin, the mechanisms, the impacts on firm vulnerability and network health, and the mitigation strategies of the disruption. in this sense, the first and foremost step is to clarify the origin and differentiate between the two kinds of disruption propagation. for example, the 2012 japan earthquake caused disruptions to many japanese suppliers of major automobile firms in the u.s. the disruption originated from the supply side and soon wreaked havoc across the global automobile industry through the forward propagation. on the contrary, the covid-19 pandemic has left no rung of the fashion supply chain unharmed, mainly due to the demand-side disruption and backward propagation. through clarifying the origins and the types of disruption propagation, effective mitigation and restore strategies can be established. this implication aligns with emphasizing knowledge of disruption origin and severity in previous literature (craighead et al., 2007; pettit et al., 2010) , and also introduces the prac-tical importance of differentiating forward and backward disruption propagation that is largely neglected in the literature (otto, willner, wenz, frieler, & levermann, 2017) . 2. managers need to consider different mitigation methods associated with forward and backward disruption propagation. the analysis reveals that the forward propagation causes more damage to the firms serving as an assembly entity while backward propagation results in more loss to firms serving as a distribution entity. the mitigation methods associated with forward and backward disruption propagation are different in the sense that safety stock and backup supply are used to mitigate forward disruption propagation while flexible operation and demand management are generally used to manage backward disruption propagation. moreover, because resources to mitigate and recover from disruptions are limited, firms often invest more heavily in one kind of strategy that targets a specific source or direction of disruption. for example, automobile companies who suffered from the japan earthquake have developed supplier relationship and business continuity programs that help them ensure a smooth supply, whereas fashion brands during the pandemic are implementing omni-channel retailing to promote the demand level. this implication further enhances the practical importance of differentiating between forward and backward disruption propagation. hence, an appropriate and targeted strategy is critical and can potentially save firms millions of dollars (fiksel et al., 2015; . 3. managers should consider the network topology of the industry and the structural position of the firm in the network, to cope with forward and backward disruption propagation. the simulation findings demonstrate that the network structural properties moderate the impact of forward and backward disruption propagation on operational performance at both industry and firm levels. at the network-level, a supply network or an assembly network is more exposed to forward disruption propagation whereas a distribution or logistics network is more exposed to backward disruption propagation. at the node-level, firms with a higher degree centrality are more vulnerable. specifically, a positive degree difference (higher in-degree than out-degree) is more vulnerable to forward disruption propagation, whereas a negative degree difference is more influenced by backward disruption propagation. in order to accurately assess the influence of the disruption on firms and the industry, managers should take a comprehensive consideration of the supply chain network structure. echoing and extending the current literature that has put great emphasis on the impact of supply chain network structure on disruption propagation (basole & bellamy, 2014; , this implication supplements the current literature by explicating how the different directions of disruption propagation, interacted with network structure, influence both the firm vulnerability and net-work health. this research also implies that a more comprehensive understanding of network structure and how it intertwined with other factors is critical to determine the supply chain performance. 4. managers should integrate into account the information beyond the focal firm's ego network. researchers and practitioners widely acknowledge the phenomenon of common suppliers shared by multiple ego networks of buying firms. we observe in the simulation that a small local disruption can propagate to other suppliers and even suppliers outside the focal firm's ego network in the supply chain network. thus, focusing only on one firm's ego network underestimates the effect of disruption risks. (basole & bellamy, 2014; ivanov, dolgui, & sokolov, 2019) , and improving information accuracy (li, zobel, & russell, 2017) . managers of a focal firm should consider the cost-performance of the resilience investment for not only that focal firm but also other focal firms (buyers). due to the existence of common suppliers of multiple ego supply networks, one focal firm's resilience investment decisions can affect and be affected by other focal firms. our results present the boundary conditions under which resilience investment achieves its best outcome. figure 6 provides an example of applying the -benefit-cost ratio‖ that is similar to the concept of cost-performance to determine when a firm will be better off through other firm's resilience investment. in general, there is a threshold, beyond which managers of the focal firm should think to build their own firm's resilience, regardless of other firms' risk strategies. it is critical for managers to map out the network, quantify the cost-performance, and figure out the threshold based on supply networks and real contexts of disruption risks. although it is widely accepted that supply chains consist of highly interactive business entities (christopher & peck, 2004; simchi-levi et al., 2014) and common supply base exists in different supply chains (k. , to the best of our knowledge, current literature has not considered the influence of common supply base on the impact and outcome of resilience investment. this implication highlights the importance of considering and incorporating common supply base and other focal firm's resilience investment into the decision-makings process. this study explores disruption propagation that can originate from either supply or demand side and can diffuse in both directions. first, we propose a theoretical framework for disruption propagation mechanism, in which network structure, node-level disruption diffusion rate, recovery rate and firms' investment influence the supply chain disruption propagation behavior. next, we design an experiment and investigate how node-level influencing factors, especially forward and backward disruption diffusion rates, interact with the network structure and determine the firm vulnerability and network health. finally, we analyze how one focal firm's investment influences itself, the other focal firm, and the overall supply chain network. based on the analysis, we generate several important managerial implications. the results are threefold. first, at the network-level, forward and backward disruption diffusion rates do exert different effects on network health, moderated by the network structure. generally, forward disruption diffusion rate has a more severe effect on a supply network, while backward disruption diffusion rate has more impact on a distribution network. second, at the node-level, we find that both nodelevel influencing factors and node position contribute to node vulnerability. higher centrality leads to higher node vulnerability. the nodes with higher in-degree than out-degree are more sensitive to forward disruption propagation, and vice versa. finally, for a network with two focal firms, one focal firm can benefit from the other's investment. the benefit is influenced by the relationship between the focal firm and the overlapped supply network, as well as the investment level. we delineate the relationship between the focal firms' benefit-cost ratio and their strategy space, which implies that one focal firm's investment decision should consider both focal firms' benefit-cost ratios. this study contributes in many ways. theoretically, our study is among the first to consider the difference in the effects of forward and backward disruption propagation, and map them to corresponding resilience investments. moreover, this work extends the understanding of -triad‖ disruption propagation by using a network with multiple focal firms. practically, this study implies that firms should manage forward and backward propagation differently because of their distinctive origins, associated investments, and effects on node vulnerability and network health. also, a focal firm's interest is to map out not only its ego network, but also the industrial network. when making resilience investment decisions, one focal firm should consider the counterparts' decisions, its relationship to the shared network, and risk characteristics. there are several limitations of this study, which can be extended in future work. the first limitation is related to the effects of network structure on disruption propagation. our analysis is based on one realistic scn and two comparable random networks. it is enough to investigate the influence of nodelevel factors on disruption propagation, but has its limitation on addressing the effects of network struc-ture, which requires a sufficient number of samples of various network structures. considering network structure plays a critical role in disruption propagation, future studies based on a large number of network structure samples are required for further understanding of disruption propagation. the second is about the assumption of the risk independency. as disruption impacts may be interdependent in reality, future studies could relax this assumption and investigate how the interdependency of risks can influence the supply chain resilience. the third is the homogeneous setting of node-level factors required by the experiment design. although a homogenous setting allows us to effectively investigate the impacts of these factors on the ripple effect, this setting reduces the proximity to the real-world situation, as agents in a real supply chain are essentially different in terms of disruption prevention and response activities. future studies can relax this setting and individualize the agent parameter value to model a more realistic ripple effect given the supply chain structure. the fourth limitation is about the curvilinear effect of investment levels. our work shows that the effect exists in our specific setting. to derive the general curvilinear effect and the rules of the turning point that can be broadly applied, a well-designed experiment including different settings of network configurations and various fr, br, and rc levels should be implemented in a future study. systemic risk elicitation: using causal maps to engage stakeholders and build a comprehensive view of risks causality and endogeneity: problems and solutions. the oxford handbook of leadership and organizations supply network structure, visibility, and risk diffusion: a computational approach acclimate-a model for economic damage propagation. part 1: basic formulation of damage transfer within a global supply network and damage conserving dynamics stockout propagation and amplification in supply chain inventory systems triads in supply networks: theorizing buyer-supplier-supplier relationships supply chain resilience: conceptualization and scale development using dynamic capability theory building the resilient supply chain the severity of supply chain disruptions: design characteristics and mitigation capabilities u.s. travel and tourism and covid-19 risk propagation mechanisms and risk management strategies for a sustainable perishable products supply chain does the ripple effect influence the bullwhip effect? an integrated analysis of structural and operational dynamics in the supply chain † ripple effect in the supply chain: an analysis and recent literature overview: pca models and issues from risk to resilience: learning to deal with disruption the rippled newsvendor: a new inventory framework for modelling supply chain risk severity in the presence of risk propagation an analytical framework for supply network risk propagation: a bayesian network approach estimating risk propagation between interacting firms on inter-firm complex network evaluation mechanism for structural robustness of supply chain considering disruption propagation om forum -causal inference models in operations management a new resilience measure for supply networks with the ripple effect considerations: a bayesian network approach ripple effect modelling of supplier disruption: integrated markov chain and dynamic bayesian network approach experimental designs for identifying causal mechanisms 2011 annual report -intel corporation simulation-based ripple effect modelling in the supply chain supply chain risk management: bullwhip effect and ripple effect predicting the impacts of epidemic outbreaks on global supply chains: a simulationbased analysis on the coronavirus outbreak (covid-19/sars-cov-2) case the impact of digital technology and industry 4.0 on the ripple effect and supply chain risk analytics optimal distribution (re)planning in a centralized multistage supply network under conditions of the ripple effect and structure dynamics the ripple effect in supply chains: trade-off ‗efficiency-flexibility-resilience' in disruption management ripple effect quantification by supplier risk exposure assessment applied statistical linear models information distortion in a supply chain: the bullwhip effect exploring supply chain network resilience in the presence of the ripple effect value of supply disruption information and information accuracy network characteristics and supply chain resilience under conditions of risk propagation hedging against disruptions with ripple effects in location analysis the ripple effect. how manufacting and retail executive view the growing challenge of supply chain risk. deloitte development llc bottleneck identification in supply chain networks supply network topology and robustness against disruptions -an investigation using multi-agent model collaborative response to disruption propagation (crdp) in cyber-physical systems and complex networks bayesian network modelling for supply chain risk propagation modeling loss-propagation in the global supply network: the dynamic agent-based model acclimate optimization of network redundancy and contingency planning in sustainable and resilient supply chain resource management under conditions of structural dynamics ensuring supply chain resilience: development of a conceptual framework exploring dependency based probabilistic supply chain risk measures for prioritising interdependent risks and strategies supply chain disruption propagation: a systemic risk and normal accident theory perspective from superstorms to factory fires identifying risks and mitigating disruptions in the automotive supply chain measuring and mitigating the effects of cost disturbance propagation in multi-echelon apparel supply chains the -snowball effect‖ in the transmission of disruptions in supply chains: the role of intensity and span of integration the impact of supply chain integration on the -snowball effect‖ in the transmission of disruptions: an empirical evaluation of the model complex interdependent supply chain networks: cascading failure and robustness coronavirus wreaks havoc on retail supply chains globally, even as china's factories come back online the bullwhip effect: progress, trends and directions risky suppliers or risky supply chains? an empirical analysis of sub-tier supply network structure on firm performance in the high-tech sector acclimate-a model for economic damage propagation. part ii: a dynamic formulation of the backward effects of disaster-induced production failures in the global supply network modelling of cluster supply network with cascading failure spread and its vulnerability analysis transmission of a supplier's disruption risk along the supply chain: a further investigation of the chinese automotive industry consumption and performance: understanding longitudinal dynamics of recommender systems via an agent-based simulation framework modelling supply chain adaptation for disruptions: an empirically grounded complex adaptive systems approach robust sourcing from suppliers under ambiguously correlated major disruption risks key: cord-340827-vx37vlkf authors: jackson, matthew o.; yariv, leeat title: chapter 14 diffusion, strategic interaction, and social structure date: 2011-12-31 journal: handbook of social economics doi: 10.1016/b978-0-444-53187-2.00014-0 sha: doc_id: 340827 cord_uid: vx37vlkf abstract we provide an overview and synthesis of the literature on how social networks influence behaviors, with a focus on diffusion. we discuss some highlights from the empirical literature on the impact of networks on behaviors and diffusion. we also discuss some of the more prominent models of network interactions, including recent advances regarding interdependent behaviors, modeled via games on networks. jel classification codes: d85, c72, l14, z13 how we act, as well as how we are acted upon, are to a large extent influenced by our relatives, friends, and acquaintances. this is true of which profession we decide to pursue, whether or not we adopt a new technology, as well as whether or not we catch the flu. in this chapter we provide an overview of research that examines how social structure impacts economic decision making and the diffusion of innovations, behaviors, and information. we begin with a brief overview of some of the stylized facts on the role of social structure on diffusion in different realms. this is a rich area of study that includes a vast set of case studies suggesting some important regularities. with that empirical perspective, we then discuss insights from the epidemiology and random graph literatures that help shed light on the spread of infections throughout a society. contagion of this form can be thought of as a basic, but important, form of social interaction, where the social structure largely determines patterns of diffusion. this literature presents a rich understanding of questions such as: "how densely connected does a society have to be in order to have an infection reach a nontrivial fraction of its members?," "how does this depend on the infectiousness of the disease?," "how does it depend on the particulars of the social network in place?," "who is most likely to become infected?," and "how widespread is an infection likely to be?," among others. the results on this apply beyond infectious diseases, and touch upon issues ranging from the spread of information to the proliferation of ideas. while such epidemiological models provide a useful look at some types of diffusion, there are many economically relevant applications in which a different modeling approach is needed, and, in particular, where the interaction between individuals requires a game theoretic analysis. in fact, though disease and the transmission of certain ideas and bits of information can be modeled through mechanical or purely probabilistic sorts of diffusion processes, there are other important situations where individuals take decisions and care about how their social neighbors or peers behave. this applies to decisions of which products to buy, which technology to adopt, whether or not to become educated, whether to learn a language, how to vote, and so forth. such interactions involve equilibrium considerations and often have multiple potential outcomes. for example, an agent might care about the proportion of neighbors adopting a given action, or might require some threshold of stimulus before becoming convinced to take an action, or might want to take an action that is different from that of his or her neighbors (e.g., free-riding on their information gathering if they do gather information, but gathering information him or herself if neighbors do not). here we provide an overview of how the recent literature has modeled such interactions, and how it has been able to meld social structure with predictions of behavior. there is a large body of work that identifies the effects of social interactions on a wide range of applications spanning fields: epidemiology, marketing, labor markets, political science, and agriculture are only a few. while some of the empirical tools for the analysis of social interaction effects have been described in block, blume, durlauf, and ioannides (chapter 18, this volume) , and many of their implementations for research on housing decisions, labor markets, addictions, and more, have been discussed in ioannides (chapter 25, this volume), epple and romano (chapter 20, this volume), topa (chapter 22, this volume) , fafchamps (chapter 24, this volume), jackson (chapter 12, this volume), and munshi (chapter 23, this volume), we now describe empirical work that ties directly to the models that are discussed in the current chapter. in particular, we discuss several examples of studies that illustrate how social structure impacts outcomes and behaviors. the relevant studies are broadly divided into two classes. first, there are crosssectional studies that concentrate on a snapshot of time and look for correlations between social interaction patterns and observable behaviors. this class relates to the analysis below of strategic games played by a network of agents. while it can be very useful in identifying correlations, it is important to keep in mind that identifying causation is complicated without the fortuitous exogenous variation or structural underpinnings. second, there are longitudinal studies that take advantage of the inherent dynamics of diffusion. such studies have generated a number of interesting observations and are more suggestive of some of the insights the theoretical literature on diffusion has generated. nonetheless, these sorts of studies also face challenges in identifying causation because of potential unobserved factors that may contemporaneously influence linked individuals. the empirical work on these topics is immense and we provide here only a narrow look of the work that is representative of the type of studies that have been pursued and relate to the focus of this chapter. studies that are based on observations at one point of time most often compare the frequency of a certain behavior or outcome across individuals who are connected as opposed to ones that are not. for example, glaeser, sacerdote, and scheinkman (1996) showed that the structure of social interactions can help explain the cross-city variance in crime rates in the u.s.; bearman, moody, and stovel (2004) examined the network of romantic connections in high-school, and its link to phenomena such as the spread of sexually transmitted diseases (see the next subsection for a discussion of the spread of epidemics). such studies provide important evidence for the correlation of behaviors with characteristics of individuals' connections. in the case of diseases, they provide some direct evidence for diffusion patterns. with regards to labor markets, there is a rich set of studies showing the importance of social connections for diffusing information about job openings, dating back to rees (1966) and rees and schultz (1970) . influential studies by granovetter (1973 granovetter ( , 1985 granovetter ( , 1995 show that even casual or infrequent acquaintances (weak ties) can play a role in diffusing information. those studies were based on interviews that directly ask subjects how they obtained information about their current jobs. other studies, based on outcomes, such as topa (2001), conley and topa (2002) , and bayer, ross, and topa (2008) , identify local correlations in employment status within neighborhoods in chicago, and consider neighborhoods that go beyond the geographic but also include proximity in other socioeconomic dimensions, examining the extent to which local interactions are important for employment outcomes. bandiera, barankay, and rasul (2008) create a bridge between network formation (namely, the creation of friendships amongst fruit pickers) and the effectiveness of different labor contracts. the extensive literature on networks in labor markets 1 documents the important role of social connections in transmitting information about jobs, and also differentiates between different types of social contacts and shows that even weak ties can be important in relaying information. there is further (and earlier) research that examines the different roles of individuals in diffusion. important work by katz and lazarsfeld (1955) (building on earlier studies of lazarsfeld, berelson, and gaudet (1944) , merton (1948) , and others), identifies the role of "opinion leaders" in the formation of various beliefs and opinions. individuals are heterogeneous (at least in behaviors), and some specialize in becoming well informed on certain subjects, and then information and opinions diffuse to other less informed individuals via conversations with these opinion leaders. lazarsfeld, berelson, and gaudet (1944) study voting decisions in an ohio town during the 1940 u.s. presidential campaign, and document the presence and significance of such opinion leaders. katz and lazarsfeld (1955) interviewed women in decatur, illinois, and asked about a number of things such as their views on household goods, fashion, movies, and local public affairs. when women showed a change in opinion in follow-up interviews, katz and lazarsfeld traced influences that led to the change in opinion, again finding evidence for the presence of opinion leaders. diffusion of new products is understandably a topic of much research. rogers (1995) discusses numerous studies illustrating the impacts of social interactions on the diffusion of new products, and suggests various factors that impact which products succeed and which products fail. for example, related to the idea of opinion leaders, feick and price (1987) surveyed 1531 households and provided evidence that consumers recognize and make use of particular individuals in their social network termed "market mavens," those who have a high propensity to provide marketplace and shopping information. whether or not products reach such mavens can influence the success of a product, independently of the product's quality. tucker (2008) uses micro-data on the adoption and use of a new video-messaging technology in an investment bank consisting of 2118 employees. tucker notes the effects of the underlying network in that employees follow the actions of those who either have formal power, or informal influence (which is, to some extent, endogenous to a social network). in the political context, there are several studies focusing on the social sources of information electors choose, as well as on the selective mis-perception of social information they are exposed to. a prime example of such a collection of studies is huckfeldt and sprague (1995), who concentrated on the social structure in south bend, indiana, during the 1984 elections. they illustrated the likelihood of socially connected individuals to hold similar political affiliations. in fact, the phenomenon of individuals connecting to individuals who are similar to them is observed across a wide array of attributes and is termed by sociologists homophily (for overviews see mcpherson, smith-lovin, and cook, 2001, jackson, 2007 , as well as the discussion of homophily in jackson, chapter 12 in this volume). while cross-sectional studies are tremendously interesting in that they suggest dimensions on which social interactions may have an impact, they face many empirical challenges. most notably, correlations between behaviors and outcomes of individuals and their peers may be driven by common unobservables and therefore be spurious. given the strong homophily patterns in many social interactions, individuals who associate with each other often have common unobserved traits, which could lead them to similar behaviors. this makes it difficult to draw (causal) conclusions from empirical analysis of the social impact on diffusion of behaviors based on cross-sectional data. 2 given some of the challenges with causal inference based on pure observation, laboratory experiments and field experiments are quite useful in eliciting the effects of real-world networks on fully controlled strategic interactions, and are being increasingly utilized. as an example, leider, mobius, rosenblat, and do (2009) elicited the friendship network among undergraduates at a u.s. college and illustrated how altruism varies as a function of social proximity. in a similar setup, goeree, mcconnell, mitchell, tromp, and yariv (2010) elicited the friendship network in an all-girls school in pasadena, ca, together with girls' characteristics and later ran dictator games with recipients who varied in social distance. they identified a "1/d law of giving," in that the percentage given to a friend was inversely related to her social distance in the network. 3 various field experiments, such as those by duflo and saez (2003) , karlan, mobius, rosenblat, and szeidl (2009), dupas (2010) , beaman and magruder (2010) , and feigenberg, field, and pande (2010) , also provide some control over the process, while working with real-world network structures to examine network influences on behavior. 4 another approach that can be taken to infer causal relationships is via structural modeling. as an example, one can examine the implications of a particular diffusion model for the patterns of adoption that should be observed. one can then infer characteristics of the process by fitting the process parameters to best match the observed outcomes in terms of behavior. for instance, banerjee, chandrasekhar, duflo, and jackson (2010) use such an approach in a study of the diffusion of microfinance participation in rural indian villages. using a model of diffusion that incorporates both information and peer effects, they then fit the model to infer the relative importance of information diffusion versus peer influences in accounting for differences in microfinance participation rates across villages. of course, in such an approach one is only as confident in the causal inference as one is confident that the model is capturing the essential underpinnings of the diffusion process. the types of conclusions that have been reached from these cross sectional studies can be roughly summarized as follows. first, in a wide variety of settings, associated individuals tend to have correlated actions and opinions. this does not necessarily embody diffusion or causation, but as discussed in the longitudinal section below, there is significant evidence of social influence in diffusion patterns as well. second, individuals tend to associate with others who are similar to themselves, in terms of beliefs and opinions. this has an impact on the structure of social interactions, and can affect diffusion. it also represents an empirical quandary of the extent to which social structure influences opinions and behavior as opposed to the reverse (that can partly be sorted out with careful analysis of longitudinal data). third, individuals fill different roles in a society, with some acting as "opinion leaders," and being key conduits of information and potential catalysts for diffusion. longitudinal data can be especially important in diffusion studies, as they provide information on how opinions and behaviors move through a society over time. they also help sort out issues of causation as well as supply-specific information about the extent to which behaviors and opinions are adopted dynamically, and by whom. such data can be especially important in going beyond the documentation of correlation between social connections and behaviors, and illustrating that social links are truly the conduits for information and diffusion if one is careful to track what is observed by whom at what point in time, and can measure the resulting changes in behavior. for example, conley and udry (2008) show that pineapple growers in ghana tend to follow those farmers who succeed in changing their levels of use of fertilizers. through careful examination of local ties, and the timing of different actions, they trace the influence of the outcome of one farmer's crop on subsequent behavior of other farmers. more generally, diffusion of new technologies is extremely important when looking at transitions in agriculture. seminal studies by ryan and gross (1943) and griliches (1957) examined the effects of social connections on the adoption of a new behavior, specifically the adoption of hybrid corn in the u.s. looking at aggregate adoption rates in different states, these authors illustrated that the diffusion of hybrid corn followed an s-shape curve over time: starting out slowly, accelerating, and then ultimately decelerating. 5 foster and rosenzweig (1995) collected household-level panel data from a representative sample of rural indian households having to do with the adoption and profitability of high-yielding seed varieties (associated with the green revolution). they identified significant learning-by-doing, where some of the learning was through neighbors' experience. in fact, the observation that adoption rates of new technologies, products, or behaviors exhibit s-shaped curves can be traced to very early studies, such as tarde (1903) , who discussed the importance of imitation in adoption. such patterns are found across many applications (see mahajan and peterson (1985) and rogers (1995) ). understanding diffusion is particularly important for epidemiology and medicine for several reasons. for one, it is important to understand how different types of diseases spread in a population. in addition, it is crucial to examine how new treatments get adopted. colizza, barrat, barthelemy, and vespignani (2006, 2007) tracked the spread of severe acute respiratory syndrome (sars) across the world combining census data with data on almost all air transit during the years 2002-2003. they illustrated the importance of structures of long-range transit networks for the spread of an epidemic. coleman, katz, and menzel (1966) is one of the first studies to document the role of social networks in diffusion processes. the study looked at the adoption of a new drug (tetracycline) by doctors and highlighted two observations. first, as with hybrid corn, adoption rates followed an s-shape curve over time. second, adoption rates depended on the density of social interactions. doctors with more contacts (measured according to the trust placed in them by other doctors) adopted at higher rates and earlier in time. 6 diffusion can occur in many different arenas of human behavior. for example christakis and fowler (2007) document influences of social contacts on obesity levels. they studied the social network of 12,067 individuals in the u.s. assessed repeatedly from 1971 to 2003 as part of the framingham heart study. concentrating on bodymass index, christakis and fowler found that a person's chances of becoming obese increased by 57% if he or she had a friend who became obese, by 40% if he or she had a sibling who became obese, and by 37% if they had a spouse who became obese in a previous period. the study controls for various selection effects, and takes advantage of the direction of friendship nominations to help sort out causation. for example, christakis and fowler find a significantly higher increase of an individual's body mass index in reaction to the obesity of someone that the individual named as a friend compared to someone who had named the individual as a friend. this is one method of sorting out causation, since if unobserved influences that were common to the agents were at work, then the direction of who mentioned the other as a friend would not matter, whereas direction would matter if it indicated which individuals react to which others. based on this analysis, christakis and fowler conclude that obesity spreads very much like an epidemic with the underlying social structure appearing to play an important role. it is worth emphasizing that even with longitudinal studies, one still has to be cautious in drawing causal inferences. the problem of homophily still looms, as linked individuals tend to have common characteristics and so may be influenced by common unobserved factors, for example, both being exposed to some external stimulus (such as advertising) at the same time. this then makes it appear as if one agent's behavior closely followed another's, even when it may simply be due to both having experienced a common external event that prompted their behaviors. aral, muchnik, and sundararajan (2009) provide an idea of how large this effect can be, by carefully tracking individual characteristics and then using propensity scores (likelihoods of having neighbors with certain behaviors) to illustrate the extent to which one can over-estimate diffusion effects by not accounting for common backgrounds of connected individuals. homophily not only suggests that linked individuals might be exposed to common influences, it also makes it hard to disentangle which of the following two processes is at the root of observed similarities in behavior between connected agents. it could be that similar behavior in fact comes from a process of selection (assortative pairing), in which similarity precedes association. alternatively, it could be a consequence of a process of socialization, in which association leads to similarity. in that respect, tracking connections and behaviors over time is particularly useful. kandel (1978) concentrated on adolescent friendship pairs and examined the levels of homophily on four attributes (frequency of current marijuana use, level of educational aspirations, political orientation, and participation in minor delinquency) at various stages of friendship formation and dissolution. she noted that observed homophily in friendship dyads resulted from a significant combination of both types of processes, so that individuals emulated their friends, but also tended to drop friendships with those more different from themselves and add new friendships to those more similar to themselves. 7 in summary, let us mention a few of the important conclusions obtained from studies of diffusion. first, not only are behaviors across socially connected individuals correlated, but individuals do influence each other. while this may sound straightforward, it takes careful control to ensure that it is not unobserved correlated traits or influences that lead to similar actions by connected individuals, as well as an analysis of similarities between friends that can lead to correlations in their preferences and the things that influence them. second, in various settings, more socially connected individuals adopt new behaviors and products earlier and at higher rates. third, diffusion exhibits specific patterns over time, and specifically there are many settings where an "s"-shaped pattern emerges, with adoption starting slowly, then accelerating, and eventually asymptoting. fourth, many diffusion processes are affected by the specifics of the patterns of interaction. we now turn to discussing various models of diffusion. as should be clear from our description of the empirical work on diffusion and behavior, models can help greatly in clarifying the tensions at play. given the issues associated with the endogeneity of social relationships, and the substantial homophily that may lead to correlated behaviors among social neighbors, it is critical to have models that help predict how behavior should evolve and how it interacts with the social structure in place. we start with some of the early models that do not account for the underlying network architecture per-se. these models incorporate the empirical observations regarding social influence through the particular dynamics assumed, or preferences posited, and generate predictions matching the aggregate empirical observations regarding diffusion over time of products, diseases, or behavior. for example, the so-called s-shaped adoption curves. after describing these models, we return to explicitly capturing the role of social networks. one of the earliest and still widely used models of diffusion is the bass (1969) model. this is a parsimonious model, which can be thought of as a "macro" model: it makes predictions about aggregate behavior in terms of the percentage of potential adopters of a product or behavior who will have adopted by a given time. the current rate of change of adoption depends on the current level and two critical parameters. these two parameters are linked to the rate at which people innovate or adopt on their own, and the rate at which they imitate or adopt because others have, thereby putting into (theoretical) force the empirical observation regarding peers' influence. if we let g(t) be the percentage of agents who have adopted by time t, and m be the fraction of agents in the population who are potential adopters, a discrete time version of the bass model is characterized by the difference equation where p is a rate of innovation and q is a rate of imitation. to glean some intuition, note that the expression p (m ã� g(t ã� 1)) represents the fraction of people who have not yet adopted and might potentially do so times the rate of spontaneous adoption. in the expression qã°m ã� gã°t ã� 1ã�ã� gã°tã�1ã� m , the rate of imitation is multiplied by two factors. the first factor, (m ã� g(t ã� 1)), is the fraction of people who have not yet adopted and may still do so. the second expression, g tã�1 ã° ã� m , is the relative fraction of potential adopters who are around to imitate. if we set m equal to 1, and look at a continuous time version of the above difference equation, we get where g is the rate of diffusion (times the rate of change of g). solving this when p > 0 and setting the initial set of adopters at 0, g(0) â¼ 0, leads to the following expression: this is a fairly flexible formula that works well at fitting time series data of innovations. by estimating p and q from existing data, one can also make forecasts of future diffusion. it has been used extensively in marketing and for the general analysis of diffusion (e.g., rogers (1995)), and has spawned many extensions and variations. 8 if q is large enough, 9 then there is a sufficient imitation/social effect, which means that the rate of adoption accelerates after it begins, and so g(t) is s-shaped (see figure 1 ), matching one of the main insights of the longitudinal empirical studies on diffusion discussed above. the bass model provides a clear intuition for why adoption curves would be s-shaped. indeed, when the adoption process begins, imitation plays a minor role (relative to innovation) since not many agents have adopted yet and so the volume of adopters grows slowly. as the number of adopters increases, the process starts to accelerate as now innovators are joined by imitators. the process eventually starts to slow down, in part simply because there are fewer agents left to adopt (the term 1ã�g(t) in (1) eventually becomes small). thus, we see a process that starts out slowly, then accelerates, and then eventually slows and asymptotes. the bass model described above is mechanical in that adopters and imitators are randomly determined; they do not choose actions strategically. the empirical observation that individuals influence each other through social contact can be derived through agents' preferences, rather than through some exogenously specified dynamics. diffusion in a strategic context was first studied without a specific structure for interactions. broadly speaking, there were two approaches taken in this early literature. in the first, all agents are connected to one another (that is, they form a complete network). effectively, this corresponds to a standard multi-agent game in which payoffs to each player depend on the entire profile of actions played in the population. the second approach has been to look at interactions in which agents are matched to partners in a random fashion. diffusion on complete networks. granovetter (1978) considered a model in which n agents are all connected to one another and each agent chooses one of two actions: 0 or 1. associated with each agent i is a number n i . this is a threshold such that if at least n i other agents take action 1 then i prefers action 1 to action 0, and if fewer than n i other agents take action 1 then agent i prefers to take action 0. the game exhibits what are known as strategic complementarities. for instance, suppose that the utility of agent i faced with a profile of actions (x 1 , . . ., x n ) 2 {0, 1} n is described by: where c i is randomly drawn from a distribution f over [0,1]. c i can be thought of as a cost that agent i experiences upon choosing action 1 (e.g., a one-time switching cost from one technology to the other, or potential time costs of joining a political revolt, etc.). the utility of agent i is normalized to 0 when choosing the action 0. when choosing the action 1, agent i experiences a benefit proportional to the fraction of other agents choosing the action 1 and a cost of c i . granovetter considered a dynamic model in which at each stage agents best respond to the previous period's distribution of actions. if in period t there was a fraction x t of agents choosing the action 1, then in period t ã¾ 1 an agent i chooses action 1 if and only if his or her cost is lower than nx t ã�x t i n ã�1 , the fraction of other agents taking action 1 in the last period. for a large population, then corresponds to an (approximate) equilibrium of a large population. the shape of the distribution f determines which equilibria are tipping points: equilibria such that only a slight addition to the fraction of agents choosing the action 1 shifts the population, under the best response dynamics, to the next higher equilibrium level of adoption (we return to a discussion of tipping and stable points when we consider a more general model of strategic interactions on networks below). note that while in the bass model the diffusion path was determined by g(t), the fraction of adopters as a function of time, here it is easier to work with f(x), corresponding to the fraction of adopters as a function of the previous period's fraction x. although granovetter (1978) does not examine conditions under which the time series will exhibit attributes like the s-shape that we discussed above, by using techniques from jackson and yariv (2007) we can derive such results, as we now discuss. keeping track of time in discrete periods (a continuous time analog is straightforward), the level of change of adoption in the society is given by thus, to derive an s-shape, we need this quantity to initially be increasing, and then eventually to decrease. assuming differentiability of f, this corresponds to the derivative of d(x t ) being positive up to some x and then negative. the derivative of f(x) ã� x is f 0 (x) ã� 1 and having an s-shape corresponds to f 0 being greater than 1 up to some point and then less than 1 beyond that point. for instance, if f is concave with an initial slope greater than 1 and an eventual slope less than 1, this is satisfied. note that the s-shape of adoption over time does not translate into an s-shape of f -but rather a sort of concavity. 10 the idea is that we initially need a rapid level of change, which corresponds to an initially high slope of f, and then a slowing down, which corresponds to a lower slope of f. fashions and random matching. a different approach than that of the bass model is taken by pesendorfer (1995) , who considers a model in which individuals are randomly matched and new fashions serve as signaling instruments for the creation of matches. he identifies particular matching technologies that generate fashion cycles. pesendorfer describes the spread of a new fashion as well as its decay over time. in pesendorfer's model, the price of the design falls as it spreads across the population. once sufficiently many consumers own the design, it is profitable to create a new design and thereby render the old design obsolete. in particular, demand for any new innovation eventually levels off as in the above two models. information cascades and learning. another influence on collective behavior derives from social learning. this can happen without any direct complementarities in actions, but due to information flow about the potential payoffs from different behaviors. if people discuss which products are worth buying, or which technologies are worth adopting, books worth reading, and so forth, even without any complementarities in behaviors, one can end up with cascades in behavior, as people infer information from others' behaviors and can (rationally) imitate them. as effects along these lines are discussed at some length in jackson (chapter 12, this volume) and goyal (chapter 15, this volume), we will not detail them here. we only stress that pure information transmission can lead to diffusion of behaviors. we now turn to models that explicitly incorporate social structure in examining diffusion patterns. we start with models that stem mostly from the epidemiology literature and account for the underlying social network, but are mechanical in terms of the way that disease spreads from one individual to another (much like the bass model described above). we then proceed to models in which players make choices that depend on their neighbors' actions as embedded in a social network; for instance, only adopting an action if a certain proportion of neighbors adopt as well (as in granovetter's setup), or possibly not adopting an action if enough neighbors do so. many models of diffusion and strategic interaction on networks have the following common elements. there is a finite set of agents n â¼ {1, . . ., n}. agents are connected by a (possibly directed) network g 2 {0, 1} nã�n . we let n i (g) {j : g ij â¼ 1} be the neighbors of i. the degree of a node i is the number of her neighbors, d i jn i (g)j. when links are determined through some random process, it is often useful to summarize the process by the resulting distribution of degrees p, where p(d) denotes the probability a random individual has a degree of d. 11, 12 each agent i 2 n takes an action x i . in order to unify and simplify the description of various models, we focus on binary actions, so that x i 2 {0, 1}. actions can be metaphors for becoming "infected" or not, buying a new product or not, choosing one of two activities, and so forth. some basic insights about the extent to which behavior or an infection can spread in a society can be derived from random graph theory. random graph theory provides a tractable base for understanding characteristics important for diffusion, such as the structure and size of the components of a network, maximally connected subnetworks. 13 before presenting some results, let us talk through some of the ideas in the context of what is known as the reed-frost model. 14 consider, for example, the spread of a disease. initially, some individuals in the society are infected through mutations of a germ or other exogenous sources. consequently, some of these individuals' neighbors are infected through contact, while others are not. this depends on how virulent the disease is, among other things. in this application, it makes sense (at least as a starting point) to assume that becoming infected or avoiding infection is not a choice; 11 such a description is not complete, in that it does not specify the potential correlations between degrees of different individuals on the network. see galeotti, goyal, jackson, vega-redondo, and yariv (2010) for more details. 12 in principle, one would want to calibrate degree distributions with actual data. the literature on network formation, see bloch and dutta (chapter 16, this volume) and jackson (chapter 12, this volume), suggests some insights on plausible degree distributions p(d). 13 formally, these are the subnetworks projected induced by maximal sets c n of nodes such any two distinct nodes in c are path connected within c. that is, for any i,j 2 c, there exist i 1 , . . ., i k 2 c such that g ii1 â¼ g i1i2 â¼ . . . â¼ g ikã�1ik â¼ g ikj â¼ 1. 14 see jackson (2008) for a more detailed discussion of this and related models. i.e., contagion here is nonstrategic. in the simplest model, there is a probability p ! 0 that a given individual is immune (e.g., through vaccination or natural defenses). if an individual is not immune, it is assumed that he or she is sure to catch the disease if one of his or her neighbors ends up with the disease. in this case, in order to estimate the volume of those ultimately infected, we proceed in two steps, depicted in figure 2 . first, we delete a fraction p of the nodes that will never be infected (these correspond to the dotted nodes in the figure) . then, we note that the components of the remaining network that contain the originally infected individuals comprise the full extent of the infection. in particular, if we can characterize what the components of the network look like after removing some portion of the nodes, we have an idea of the extent of the infection. in figure 2 , we start with one large connected component (circumvented by a dotted line) and two small-connected components. after removing the immune agents, there is still a large connected component (though smaller than before), and four small components. thus, the estimation of the extent of infection of the society is reduced to the estimation of the component structure of the network. a starting point for the formal analysis of this sort of model uses the canonical random network model, where links are formed independently, each with an identical probability p > 0 of being present. this is sometimes referred to as a "poisson random network" as its degree distribution is approximated by a poisson distribution if p is not excessively large; and has various other aliases such as an "erdã¶s-renyi random graph," a "bernoulli random graph," or a "g(n,p)" random graph (see jackson, chapter 12 in this volume, for more removing immune agents background). ultimately, the analysis boils down to considering a network on (1ã�p)n nodes with an independent link probability of p, and then measuring the size of the component containing a randomly chosen initially infected node. clearly, with a fixed set of nodes, and a positive probability p that lies strictly between 0 and 1, every conceivable network on the given set of nodes could arise. thus, in order to say something specific about the properties of the networks that are "most likely" to arise, one generally works with large n where reasoning based on laws of large numbers can be employed. for example, if we think of letting n grow, we can ask for which p's (that are now dependent on n) a nonvanishing fraction of nodes will become infected with a probability bounded away from 0. so, let us consider a sequence of societies indexed by n and corresponding probabilities of links p(n). erdã¶s and renyi (1959 renyi ( , 1960 proved a series of results that characterize some basic properties of such random graphs. in particular, 15 â�¢ the threshold for the existence of a "giant component," a component that contains a nontrivial fraction of the population, is 1/n, corresponding to an average degree of 1. that is, if p(n) over 1/n tends to infinity, then the probability of having a giant component tends to 1, while if p(n) over 1/n tends to 0, then the probability of having a giant component tends to 0. â�¢ the threshold for the network to be connected (so that every two nodes have a path between them) is log(n)/n, corresponding to an average degree that is proportional to log(n). the logic for the first threshold is easy to explain, though the proof is rather involved. to heuristically derive the threshold for the emergence of a giant component, consider following a link out of a given node. we ask whether or not one would expect to be able to find a link to another node from that one. if the expected degree is much smaller than 1, then following the few (if any) links from any given node is likely to lead to dead-ends. in contrast, when the expected degree is much higher than 1, then from any given node, one expects to be able to reach more nodes, and then even more nodes, and so forth, and so the component should expand outward. note that adjusting for the factor p of the number of immune nodes does not affect the above thresholds as they apply as limiting results, although the factor will be important for any fixed n. between these two thresholds, there is only one giant component, so that the next largest component is of a size that is a vanishing fraction of the giant component. this is intuitively clear, as to have two large components requires many links within each component but no links between the two components, which is an unlikely event. in that sense, the image that emerges from figure 2 of one large connected component is reasonably typical for a range of parameter values. these results then tell us that in a random network, if average degree is quite low (smaller than 1), then any initial infection is likely to die out. in contrast, if average degree is quite high (larger than log(n)), then any initial infection is likely to spread to all of the susceptible individuals, i.e., a fraction of 1 ã� p of the population. in the intermediate range, there is a probability that the infection will die out and also a probability that it will infect a nontrivial, but limited, portion of the susceptible population. there, it can be shown that for such random networks and large n, the fraction of nodes in the giant component of susceptible nodes is roughly approximated by the nonzero q that solves here, q is an approximation of the probability of the infection spreading to a nontrivial fraction of nodes, and also of the percentage of susceptible nodes that would be infected. 16 this provides a rough idea of the type of results that can be derived from random graph theory. there is much more that is known, as one can work with other models of random graphs (other than ones where each link has an identical probability), richer models of probabilistic infection between nodes, as well as derive more information about the potential distribution of infected individuals. it should also be emphasized that while the discussion here is in terms of "infection," the applications clearly extend to many of the other contexts we have been mentioning, such as the transmission of ideas and information. fuller treatment of behaviors, where individual decisions depend in more complicated ways on neighbors' decisions, are treated in section 4.3. the above analysis of diffusion presumes that once infected, a node eventually infects all of its susceptible neighbors. this misses important aspects of many applications. in terms of diseases, infected nodes can either recover and stop transmitting a disease, or die and completely disappear from the network. transmission will also generally be probabilistic, depending on the type of interaction and its extent. 17 similarly, if we think of behaviors, it might be that the likelihood that a node is still actively transmitting a bit of information to its neighbors decreases over time. ultimately, we will discuss models that allow for rather general strategic impact of peer behavior (a generalization of the approach taken by granovetter). but first we discuss some aspects of the epidemiology literature that takes steps forward in that direction by considering two alternative models that keep track of the state of nodes and are more explicitly dynamic. the common terminology for the possible states that 16 again, see chapter 4 in jackson (2008) for more details. 17 probabistic transmission is easily handled in the above model by simply adjusting the link probability to reflect the fact that some links might not transmit the disease. a node can be in are: susceptible, where a node is not currently infected or transmitting a disease but can catch it; infected, where a node has a disease and can transmit it to its neighbors; and removed (or recovered), where a node has been infected but is no longer able to transmit the disease and cannot be re-infected. the first of the leading models is the "sir" model (dating to kermack and mckendrick, 1927) , where nodes are initially susceptible but can catch the disease from infected neighbors. once infected, a node continues to infect neighbors until it is randomly removed from the system. this fits well the biology of some childhood diseases, such as the chicken pox, where one can only be infected once. the other model is the "sis" model (see bailey, 1975) , where once infected, nodes can randomly recover, but then they are susceptible again. this corresponds well with an assortment of bacterial infections, viruses, and flus, where one transitions back and forth between health and illness. the analysis of the sir model is a variant of the component-size analysis discussed above. the idea is that there is a random chance that an "infected" node infects a given "susceptible" neighbor before becoming "removed." roughly, one examines component structures in which instead of removing nodes randomly, one removes links randomly from the network. this results in variations on the above sorts of calculations, where there are adjusted thresholds for infection depending on the relative rates of how quickly infected nodes can infect their neighbors compared to how quickly they are removed. in contrast, the sis model involves a different sort of analysis. the canonical version of that model is best viewed as one with a random matching process rather than a social network. in particular, suppose that a node i in each period will have interactions with d i other individuals from the population. recall our notation of p(d) describing the proportion of the population that has degree d (so d interactions per period). the matches are determined randomly, in such a way that if i is matched with j, then the probability that j has degree d > 0 is given bỹ where hã�i represents the expectation with respect to p. 18 this reflects the fact that an agent is proportionally more likely to be matched with other individuals who have lots of connections. to justify this formally, one needs an infinite population. indeed, with any finite population of agents with heterogeneous degrees, the emergent networks will generally exhibit some correlation between neighbors' degrees. 19 individuals who have high degrees will have more interactions per period and will generally be more likely to be infected at any given time. an important calculation 18 we consider only individuals who have degree d > 0, as others do not participate in the society. 19 see the appendix of currarini, jackson, and pin (2009) for some details along this line. then pertains to the chance that a given meeting will be with an infected individual. if the infection rate of degree d individuals is r(d ), the probability that any given meeting will be with an infected individual is y, where the chance of meeting an infected individual in a given encounter then differs from the average infection rate in the population, which is just r â¼ p p d ã° ã�r d ã° ã�, because y is weighted by the rate at which individuals meet each other. a standard version of contagion that is commonly analyzed is one in which the probability of an agent of degree d becoming infected is where n 2 (0, 1) is a rate of transmission of infection in a given period, and is small enough so that this probability is less than one. if n is very small, this is an approximation of getting infected under d interactions with each having an (independent) probability y of being infected and then conditionally (and independently) having a probability n of getting infected through contact with a given infected individual. the last part of the model is that in any given period, an infected individual recovers and becomes susceptible with a probability d 2 (0, 1). if such a system operates on a finite population, then eventually all agents will become susceptible and that would end the infection. if there is a small probability of a new mutation and infection in any given period, the system will be ergodic and always have some probability of future infection. to get a feeling for the long run outcomes in large societies, the literature has examined a steady state (i.e., a situation in which the system essentially remains constant) of a process that is idealized as operating on an infinite (continuous) population. formally, a steady-state is defined by having r(d) be constant over time for each d. working with an approximation at the limit (termed a "mean-field" approximation that in this case can be justified with a continuum of agents, but with quite a bit of technical detail), a steady-state condition can be derived to be for each d. (1ã�r(d))nyd is the rate at which agents of degree d who were susceptible become infected and r(d)d is the rate at which infected individuals of degree d recover. letting l â¼ n d , it follows that solving (5) and (8) simultaneously leads to a characterization of the steady-state y: this system always has a solution, and therefore a steady-state, where y â¼ 0 so there is no infection. it can also have other solutions under which y is positive (but always below 1 if l is finite). unless p takes very specific forms, it can be difficult to find steady states y > 0 analytically. special cases have been analyzed, such as the case of a power distribution, where p(d ) â¼ 2d ã�3 (e.g., see pastor-satorras and vespignani (2000, 2001) ). in that case, there is always a positive steady-state infection rate. more generally, lopez-pintado (2008) addresses the question of when it is that there will be a positive steady-state infection rate. to get some intuition for her results, let so that the equation y â¼ h(y) corresponds to steady states of the system. we can now extend the analysis of granovetter's (1978) model that we described above, with this richer model in which h(y) accounts for network attributes. while the fixed-point equation identifying granovetter's stable points allowed for rather arbitrary diffusion patterns (depending on the cost distribution f), the function h has additional structure to it that we can explore. in particular, suppose we examine the infection rate that would result if we start at a rate of y and then run the system on an infinite population for one period. noting that h(0) â¼ 0, it is clear that 0 is always a fixed-point and thus a steady-state. since h(1) < 1, and h is increasing and strictly concave in y (which is seen by examining its first and second derivatives), there can be at most one fixed-point besides 0. for there to be another fixed-point (steady-state) above y â¼ 0, it must be that h 0 (0) is above 1, or else, given the strict concavity, we would have h(y) < y for all positive y. moreover, in cases where h 0 (0) > 1, a small perturbation away from a 0 infection rate will lead to increased infection. in the terminology we have introduced above, 0 would be a tipping point. since higher infection rates lead to the possibility of positive infection, as do degree distributions with high variances (relative to mean). the idea behind having a high variance is that there will be some "hub nodes" with high degree, who can foster contagion. going back to our empirical insights, this analysis fits the observations that highlylinked individuals are more likely to get infected and experience speedier diffusion. whether the aggregate behavior exhibits the s-shape that is common in many realworld diffusion processes will depend on the particulars of h, much in the same way that we discussed how the s-shape in granovetter's model depends on the shape of the distribution of costs f in that model. here, things are slightly complicated since h is a function of y, which is the probability of infection of a neighbor, and not the overall probability of infection of the population. thus, one needs to further translate how various y's over time translate into population fractions that are infected. beyond the extant empirical studies, this analysis provides some intuitions behind what is needed for an infection to be possible. it does not, however, provide an idea of how extensive the infection spread will be and how that depends on network structure. while this does not boil down to as simple a comparison as (12), there is still much that can be deduced using (9), as shown by jackson and rogers (2007) . while one cannot always directly solve notice that lyd 2 hdiã°lyd ã¾ 1ã� is an increasing and convex function of d. therefore, the right hand side of the above equality can be ordered when comparing different degree distributions in the sense of stochastic dominance (we will return to these sorts of comparisons in some of the models we discuss below). the interesting conclusion regarding steady-state infection rates is that they depend on network structure in ways that are very different at low levels of the infection rate l compared to high levels. while the above models provide some ideas about how social structure impacts diffusion, they are limited to settings where, roughly speaking, the probability that a given individual adopts a behavior is simply proportional to the infection rate of neighbors. especially when it comes to situations in which opinions or technologies are adopted, purchasing decisions are made, etc., an individual's decision can depend in much more complicated ways on the behavior of his or her neighbors. such interaction naturally calls on game theory as a tool for modeling these richer interactions. we start with static models of interactions on networks that allow for a rather general impact of peers' actions on one's own optimal choices. the first model that explicitly examines games played on a network is the model of "graphical games" as introduced by kearns, littman, and singh (2001) , and analyzed by kakade, kearns, langford, and ortiz (2003) , among others. the underlying premise in the graphical games model is that agents' payoffs depend on their own actions and the actions of their direct neighbors, as determined by the network of connections. 20 formally, the payoff structure underlying a graphical game is as follows. the payoff to each player i when the profile of actions is x â¼ (x 1 , . . ., x n ) is where x n i g ã° ã� is the profile of actions taken by the neighbors of i in the network g. most of the empirical applications discussed earlier entailed agents responding to neighbors' actions in roughly one of two ways. in some contexts, such as those pertaining to the adoption of a new product or new agricultural grain, decisions to join the workforce, or to join a criminal network, agents conceivably gain more from a particular action the greater is the volume of peers who choose a similar action. that is, payoffs exhibit strategic complementarities. in other contexts, such as experimentation on a new drug, or contribution to a public good, when an agent's neighbors choose a particular action, the relative payoff the agent gains from choosing a similar action decreases, and there is strategic substitutability. the graphical games environment allows for the analysis of both types of setups, as the following example (taken from galeotti, goyal, jackson, vega-redondo, and yariv (2010)) illustrates. example 1 (payoffs depend on the sum of actions) player i's payoff function when he or she chooses x i and her k neighbors choose the profile (x 1 , . . ., x k ) is: where f(ã�) is nondecreasing and c(ã�) is a "cost" function associated with own effort (more general but much in the spirit of (2)). the parameter l 2 r determines the nature of the externality across players' actions. the shape and sign of lf determine the effects of neighbors' action choices on one's own optimal choice. in particular, the example yields strict strategic substitutes (complements) if, assuming differentiability, lf 00 is negative (positive). there are several papers that analyze graphical games for particular choices of f and l. to mention a few examples, the case where f is concave, l â¼ 1, and c(ã�) is increasing and linear corresponds to the case of information sharing as a local public good studied by bramoullã© and kranton (2007) , where actions are strategic substitutes. in contrast, if l â¼ 1, but f is convex (with c 00 > f 00 > 0), we obtain a model with strategic complements, as proposed by goyal and moraga-gonzalez (2001) to study collaboration among local monopolies. in fact, the formulation in (13) is general enough to accommodate numerous further examples in the literature such as human capital investment (calvã³-armengol and jackson (2009)), crime and other networks (ballester, calvã³-armengol, and zenou (2006) ), some coordination problems (ellison (1993) ), and the onset of social unrest (chwe (2000) ). the computer science literature (e.g., the literature following kearns, littman, and singh (2001) , and analyzed by kakade, kearns, langford, and ortiz (2003) ) has focused predominantly on the question of when an efficient (polynomial-time) algorithm can be provided to compute nash equilibria of graphical games. it has not had much to say about the properties of equilibria, which is important when thinking about applying such models to analyze diffusion in the presence of strategic interaction. in contrast, the economics literature has concentrated on characterizing equilibrium outcomes for particular applications, and deriving general comparative statics with respect to agents' positions in a network and with respect to the network architecture itself. information players hold regarding the underlying network (namely, whether they are fully informed of the entire set of connections in the population, or only of connections in some local neighborhood) ends up playing a crucial role in the scope of predictions generated by network game models. importantly, graphical games are ones in which agents have complete information regarding the networks in place. consequently, such models suffer from inherent multiplicity problems, as clearly illustrated in the following example. it is based on a variation of (13), which is similar to a model analyzed by bramoullã© and kranton (2007) . example 2 (multiplicity -complete information) suppose that in (13), we set l â¼ 1, choose x i 2 {0, 1}, and have and c(x i ) cx i , where 0 < c < 1. this game, often labeled the best-shot public goods game, may be viewed as a game of local public-good provision. each agent would choose the action 1 (say, experimenting with a new grain, or buying a product that can be shared with one's friends) if they were alone (or no one else experimented), but would prefer that one of their neighbors incur the cost c that the action 1 entails (when experimentation is observed publicly). effectively, an agent just needs at least one agent in his or her neighborhood to take action 1 to enjoy its full benefits, but prefers that it be someone else given that the action is costly and there is no additional benefit beyond one person taking the action. note that, since c < 1, in any (pure strategy) nash equilibrium, for any player i with k neighbors, it must be the case that one of the agents in the neighborhood chooses the action 1. that is, if the chosen profile is (x 1 , . . ., x k ), then x i ã¾ p k jâ¼1 x j ! 1. in fact, there is a very rich set of equilibria in this game. to see this, consider a star network and note that there exist two equilibria, one in which the center chooses 0 and the spokes choose 1, and a second equilibrium in which the spoke players choose 0 while the center chooses 1. figure 3 illustrates these two equilibria. in the first, depicted in the left panel of the figure, the center earns more than the spoke players, while in the second equilibrium (in the right panel) it is the other way round. even in the simplest network structures equilibrium multiplicity may arise and the relation between network architecture, equilibrium actions, and systematic patterns can be difficult to discover. while complete information regarding the structure of the social network imposed in graphical game models may be very sensible when the relevant network of agents is small, in large groups of agents (such as a country's electorate, the entire set of corn growers in the 50's, sites in the world-wide web, or academic economists), it is often the case that individuals have noisy perceptions of their network's architecture. as the discussion above stressed, complete information poses many challenges because of the widespread occurrence of equilibrium multiplicity that accompanies it. in contrast, when one looks at another benchmark, where agents know how many neighbors they will have but not who they will be, the equilibrium correspondence is much easier to deal with. moreover, this benchmark is an idealized model of settings in which agents make choices like learning a language or adopting a technology that they will use over a long time. in such contexts, agents have some idea of how many interactions they are likely to have in the future, but not exactly with whom the interactions will be. a network game is a modification of a graphical game in which agents can have private and incomplete information regarding the realized social network at place. we describe here the setup corresponding to that analyzed by galeotti, goyal, jackson, vega-redondo, and yariv (2010) and yariv (2005, 2007) , restricting attention to binary action games. 21 uncertainty is operationalized by assuming the network is determined according to some random process yielding our distribution over agents' degrees, p(d), which is common knowledge. each player i has d i interactions, but does not know how many interactions each neighbor has. thus, each player knows something about his or her local neighborhood (the number of direct neighbors), but only the distribution of links in the remaining population. consider now the following utility specification, a generalization of (2). agent i has a cost of choosing 1, denoted c i . costs are randomly and independently distributed across the society, according to a distribution f c . normalize the utility from the action 0 to 0 and let the benefit of agent i from action 1 be denoted by v(d i , x), where d i is i 0 s degree and she expects each of her neighbors to independently choose the action 1 with probability x. agent i's added payoff from adopting behavior 1 over sticking to the action 0 is then this captures how the number of neighbors that i has, as well as their propensity to choose the action 1, affects the benefits from adopting 1. in particular, i prefers to choose the action 1 if this is a simple cost-benefit analysis generalizing granovetter (1978) 's setup in that benefits can now depend on one's own degree (so that the underlying network is accounted for). let f(d, x) f c (v(d, x) ). in words, f(d, x) is the probability that a random agent of degree d chooses the action 1 when anticipating that each neighbor will choose 1 with an independent probability x. note that v(d, x) can encompass all sorts of social interactions. in particular, it allows for a simple generalization of granovetter's (1978) model to situations in which agents' payoffs depend on the expected number of neighbors adopting, dx. existence of symmetric bayesian equilibria follows standard arguments. in cases where v is nondecreasing in x for each d, it is a direct consequence of tarski's fixed-point theorem. in fact, in this case, there exists an equilibrium in pure strategies. in other cases, provided v is continuous in x for each d, a fixed-point can still be found by appealing to standard theorems (e.g., kakutani) and admitting mixed strategies. 22 homogeneous costs. suppose first that all individuals experience the same cost c > 0 of choosing the action 1 (much like in example 2 above). in that case, as long as v(d, x) is monotonic in d (nonincreasing or nondecreasing), equilibria are characterized by a threshold. indeed, suppose v(d, x) is increasing in d, then any equilibrium is characterized by a threshold d ã� such that all agents of degree d < d ã� choose the action 0 and all agents of degree d > d ã� choose the action 1 (and agents of degree d ã� may mix between the actions). in particular, notice that the type of multiplicity that appeared in example 2 no longer occurs (provided degree distributions are not trivial). it is now possible to look at comparative statics of equilibrium behavior and outcomes using stochastic dominance arguments on the network itself. for ease of exposition, we illustrate this in the case of nonatomic costs (see galeotti, goyal, jackson, vega-redondo, and yariv (2010) for the general analysis). heterogeneous costs. consider the case in which f c is a continuous function, with no atoms. in this case, a simple equation is sufficient to characterize equilibria. let x be the probability that a randomly chosen neighbor chooses the action 1. then f(d, x) is the probability that a random (best responding) neighbor of degree d chooses the action 1. we can now proceed in a way reminiscent of the analysis of the sis model. recall thatpã°dã� denoted the probability that a random neighbor is of degree d (see equation (4)). it must be that again, a fixed-point equation captures much of what occurs in the game. in fact, equation (15) characterizes equilibria in the sense that any symmetric 23 equilibrium results in an x that satisfies the equation, and any x that satisfies the equation corresponds to an equilibrium where type (d i , c i ) chooses 1 if and only if inequality (14) holds. given that equilibria can be described by their corresponding x, we often refer to some value of x as being an "equilibrium." consider a symmetric equilibrium and a corresponding probability of x for a random neighbor to choose action 1. if the payoff function v is increasing in degree d, then the expected payoff of an agent with degree d ã¾ 1 is ã�and agents with higher degrees choose 1 with weakly higher probabilities. indeed, an agent of degree d ã¾ 1 can imitate the decisions of an 22 in such a case, the best response correspondence (allowing mixed strategies) for any (d i , c i ) as dependent on x is upper hemi-continuous and convex-valued. taking expectations with respect to d i and c i , we also have a set of population best responses as dependent on x that is upper hemi-continuous and convex valued. 23 symmetry indicates that agents with the same degree and costs follow similar actions. agent of degree d and gain at least as high a payoff. thus, if v is increasing (or, in much the same way, decreasing) in d for each x, then any symmetric equilibrium entails agents with higher degrees choosing action 1 with weakly higher (lower) probability. furthermore, agents of higher degree have higher (lower) expected payoffs. much as in the analysis of the epidemiological models, the multiplicity of equilibria is determined by the properties of f, which, in turn, correspond to properties ofp and f. for instance, â�¢ if f(d, 0) > 0 for some d in the support of p, and f is concave in x for each d, then there exists at most one fixed-point, and â�¢ if f(d, 0) â¼ 0 for all d and f is strictly concave or strictly convex in x for each d, then there are at most two equilibria-one at 0, and possibly an additional one, depending on the slope of f(x) at x â¼ 0. 24 in general, as long as the graph of f(x) crosses the 45-degree line only once, there is a unique equilibrium (see figure 4 below). 25 the set of equilibria generated in such network games is divided into stable and unstable ones (those we have already termed in section 3.2 as tipping points). the simple characterization given by (15) allows for a variety of comparative statics on fundamentals pertaining to either type of equilibrium. in what follows, we show how these 24 as before, the slope needs to be greater than 1 for there to be an additional equilibrium in the case of strict concavity, while the case of strict convexity depends on the various values of f(d, 1) across d. 25 morris and shin (2003, 2005) consider uncertainty on payoffs rather than on an underlying network. in coordination games, they identify a class of payoff shocks that lead to a unique equilibrium. heterogeneity in degrees combined with uncertainty plays a similar role in restricting the set of equilibria. in a sense, the analysis described here is a generalization in that it allows studying the impact of changes in a variety of fundamentals on the set of stable and unstable equilibria, regardless of multiplicity, in a rather rich environment. moreover, the equilibrium structure can be tied to the network of underlying social interactions. x t figure 4 the effects of shifting f(x) pointwise. comparative statics tie directly to a simple strategic diffusion process. indeed, it turns out there is a very useful technical link between the static and dynamic analysis of strategic interactions on networks. an early contribution to the study of diffusion of strategic behavior allowing for general network architectures was by morris (2000) . 26 morris (2000) considered coordination games played on networks. his analysis pertained to identifying social structures conducive to contagion, where a small fraction of the population choosing one action leads to that action spreading across the entire population. the main insight from morris (2000) is that maximal contagion occurs when the society has certain sorts of cohesion properties, where there are no groups (among those not initially infected) that are too inward looking in terms of their connections. in order to identify the full set of stable of equilibria using the above formalization, consider a diffusion process governed by best responses in discrete time (following yariv (2005, 2007) ). at time t â¼ 0, a fraction x 0 of the population is exogenously and randomly assigned the action 1, and the rest of the population is assigned the action 0. at each time t > 0, each agent, including the agents assigned to action 1 at the outset, best responds to the distribution of agents choosing the action 1 in period tã�1, accounting for the number of neighbors they have and presuming that their neighbors will be a random draw from the population. let x t d denote the fraction of those agents with degree d who have adopted behavior 1 at time t, and let x t denote the link-weighted fraction of agents who have adopted the behavior at time t. that is, using the distribution of neighbors' degreespã°dã�, then, as deduced before from equation (14), at each date t, and therefore x t â¼ x dp ã°dã�fã°d; x tã�1 ã� â¼ fã°x tã�1 ã�: as we have discussed, any rest point of the system corresponds to a static (bayesian) equilibrium of the system. 26 one can find predecessors with regards to specific architectures, usually lattices or complete mixings, such as conway's (1970) "game of life," and various agent-based models that followed such as the "voter model" (e.g., see clifford and sudbury (1973) and holley and liggett (1975) ), as well as models of stochastic stability (e.g., kandori, mailath, robb (1993) , young (1993) , ellison (1993) ). if payoffs exhibit complementarities, then convergence of behavior from any starting point is monotone, either upwards or downwards. in particular, once an agent switches behaviors, the agent will not want to switch back at a later date. 27 thus, although these best responses are myopic, any eventual changes in behavior are equivalently forward-looking. figure 4 depicts a mapping f governing the dynamics. equilibria, and resting points of the diffusion process, correspond to intersections of f with the 45-degree line. the figure allows an immediate distinction between two classes of equilibria that we discussed informally up to now. formally, an equilibrium x is stable if there exists e 0 > 0 such that f(x ã� e) > x ã� e and f(x ã¾ e) < x ã¾ e for all e 0 > e > 0. an equilibrium x is unstable or a tipping point if there exists e 0 > 0 such that f(x ã� e) < x ã� e and f(x ã¾ e) > x ã¾ e for all e 0 > e > 0. in the figure, the equilibrium to the left is a tipping point, while the equilibrium to the right is stable. the composition of the equilibrium set hinges on the shape of the function f. furthermore, note that a point-wise shift of f (as in the figure, to a new function f) shifts tipping points to the left and stable points to the right, loosely speaking (as sufficient shifts may eliminate some equilibria altogether), making adoption more likely. this simple insight allows for a variety of comparative statics. for instance, consider an increase in the cost of adoption, manifested as a first order stochastic dominance (fosd) shift of the cost distribution f c to f c . it follows immediately that: fã°xã� â¼ x dp ã°dã�f c ã°vã°d; xã�ã� x dp ã°dã�f c ã°vã°d; xã�ã� â¼ fã°xã� and the increase in costs corresponds to an increase of the tipping points and decrease of the stable equilibria (one by one). intuitively, increasing the barrier to choosing the action 1 leads to a higher fraction of existing adopters necessary to get the action 1 to spread even more. this formulation also allows for an analysis that goes beyond graphical games regarding the social network itself, using stochastic dominance arguments (following jackson and rogers (2007) ) and yariv (2005, 2007) ). for instance, consider an increase in the expected degree of each random neighbor that an agent has. that is, supposep ' fosdp and, for illustration, assume that f(d, x) is nondecreasing in d for all x. then, by the definition of fosd, f 0 ã°xã� â¼ x dp 0 ã°dã�fã°d; xã� ! x dp ã°dã�fã°d; xã� â¼ fã°xã�; and, under p 0 , tipping points are lower and stable equilibria are higher. 27 if actions are strategic substitutes, convergence may not be guaranteed for all starting points. however, whenever convergence is achieved, the rest point is an equilibrium, and the analysis can therefore be useful for such games as well. similar analysis allows for comparative statics regarding the distribution of links, by simply looking at mean preserving spreads (mps) of the underlying degree distribution. 28 going back to the dynamic path of adoption, we can generalize the insights that we derived regarding the granovetter (1978) model. namely, whether adoption paths track an s-shaped curve now depends on the shape of f, and thereby on the shape of both the cost distribution f and agents' utilities. there is now a substantial and growing body of research studying the impacts of interactions that occur on a network of connections. this work builds on the empirical observations of peer influence and generates a rich set of individual and aggregate predictions. insights that have been shown consistently in real-world data pertain to the higher propensities of contagion (of a disease, an action, or behavior) in more highly connected individuals, the role of "opinion leaders" in diffusion, as well as an aggregate s-shape of many diffusion curves. the theoretical analyses open the door to many other results, e.g., those regarding comparative statics across networks, payoffs, and cost distributions (when different actions vary in costs). future experimental and field data will hopefully complement these theoretical insights. a shortcoming of some of the theoretical analyses described in this chapter is that the foundation for modeling the underlying network is rooted in simple forms of random graphs in which there is little heterogeneity among nodes other than their connectivity. this misses a central observation from the empirical literature that illustrates again and again the presence of homophily, people's tendency to associate with other individuals who are similar to themselves. moreover, there are empirical studies that are suggestive of how homophily might impact diffusion, providing for increased local connectivity but decreased diffusion on a more global scale (see rogers (1995) for some discussion). beyond the implications that homophily has for the connectivity structure of the network, it also has implications for the propensity of individuals to be affected by neighbors' behavior: for instance, people who are more likely to, say, be immune may be more likely to be connected to one another, and, similarly, people who are more likely to be susceptible to infection may be more likely to be connected to one another. 29 furthermore, background factors linked to homophily can also affect the payoffs individuals receive when making decisions in their social network. enriching the interaction structure in that direction is crucial for deriving more accurate diffusion predictions. this is an active area of current study (e.g., see , bramoullã© and rogers (2010) , currarini, jackson, and pin (2006 , and peski (2008)). ultimately, the formation of a network and the strategic interactions that occur amongst individuals is a two-way street. developing richer models of the endogenous formation of networks, together with endogenous interactions on those networks, is an interesting direction for future work, both empirical and theoretical. 30 creating social contagion through viral product design: theory and evidence from a randomized field experiment, mimeo distinguishing influence based contagions from homophily driven diffusion in dynamic networks a field study on matching with network externalities, mimeo similarity and polarization in groups, mimeo the mathematical theory of infectious diseases and its applications who's who in networks. wanted: the key player social capital in the workplace: evidence on its formation and consequences the diffusion of microfinance in rural india a new product growth model for consumer durables place of work and place of residence: informal hiring networks and labor market outcomes who gets the job referral? evidence from a social networks experiment, mimeo chains of affection: the structure of adolescent romantic and sexual networks strategic experimentation in networks diversity and popularity in social networks, mimeo. calvã³-armengol, a the mechanism through which this occurs can be rooted in background characteristics such as wealth, or more fundamental personal attributes such as risk aversion. risk averse individuals may connect to one another and be more prone to protect themselves against diseases by, e.g., getting immunized there are also some models that study co-evolving social relationships and play in games with neighbors the spread of obesity in a large social network over 32 years communication and coordination in social networks a model of spatial conflict medical innovation: a diffusion study the role of the airline transportation network in the prediction and predictability of global epidemics invasion threshold in heterogeneous metapopulation networks socio-economic distance and spatial patterns in unemployment learning about a new technology: pineapple in ghana an economic model of friendship: homophily, minorities and segregation identifying the roles of race-based choice and chance in high school friendship network formation evolution of conventions in endogenous social networks, mimeo the role of information and social interactions in retirement plan decisions: evidence from a randomized experiment short-run subsidies and long-run adoption of new health products: evidence from a field experiment learning, local interaction, and coordination local conventions on random graphs on the evolution of random graphs the market maven: a diffuser of marketplace information building social capital through microfinance learning by doing and learning from others: human capital and technical change in agriculture complex networks and local externalities: a strategic approach, mimeo non-market interactions. nber working paper number 8053 crime and social interactions the 1/d law of giving r&d networks network formation and social coordination the strength of weak ties threshold models of collective behavior economic action and social structure: the problems of embeddedness getting a job: a study in contacts and careers hybrid corn: an exploration of the economics of technological change ergodic theorems for weakly interacting infinite systems and the voter model citizens, politics and social communication. cambridge studies in public opinion and political psychology job information networks, neighborhood effects and inequality social structure, segregation, and economic behavior social and economic networks relating network structure to diffusion properties through stochastic dominance. the b.e on the formation of interaction networks in social coordination games social games: matching and the play of finitely repeated games diffusion on social networks diffusion of behavior and equilibrium properties in network games correlated equilibria in graphical games homophily, selection, and socialization in adolescent friendships learning, mutation, and long run equilibria in games measuring trust in peruvian shantytowns, mimeo personal influence: the part played by people in the flow of mass communication graphical models for game theory a contribution to the mathematical theory of epidemics economic networks in the laboratory: a survey the people's choice: how the voter makes up his mind in a presidential campaign directed altruism and enforced reciprocity in social networks the dynamics of viral marketing contagion in complex social networks models for innovation diffusion (quantitative applications in the social sciences endogenous inequality in integrated labor markets with two-sided search birds of a feather: homophily in social networks patterns of influence: a study of interpersonal influence and of communications behavior in a local community global games: theory and applications heterogeneity and uniqueness in interaction games asymmetric effects in physician prescription behavior: the role of opinion leaders, mimeo. pastor-satorras epidemic dynamics and endemic states in complex networks design innovation and fashion cycles complementarities, group formation, and preferences for similarity workers and wages in an urban labor market the diffusion of hybrid seed corn in two iowa communities local network effects and network structure les lois de l'imitation: etude sociologique. elibron classics, translated to english in the laws of imitation social interactions, local spillovers and unemployment identifying formal and informal inuence in technology adoption with network externalities medical innovation revisited: social contagion versus marketing effort the evolution of conventions innovation diffusion in heterogeneous populations: contagion, social influence, and social learning key: cord-266771-zesp6q0w authors: pablo-martí, federico; alañón-pardo, ángel; sánchez, angel title: complex networks to understand the past: the case of roads in bourbon spain date: 2020-10-06 journal: cliometrica (berl) doi: 10.1007/s11698-020-00218-x sha: doc_id: 266771 cord_uid: zesp6q0w the work aims to study, using gis techniques and network analysis, the development of the road network in spain during the period between the war of succession and the introduction of the railway (1700–1850). our research is based on a detailed cartographic review of maps made during the war of succession, largely improving preexisting studies based on books of itineraries from the sixteenth century onwards. we build a new, complete map of the main roads at the beginning of the eighteenth century along with the matrix of transport costs for all the important towns describing the communications network. our study of this complex network, supplemented by a counterfactual analysis carried out using a simulation model based on agents using different centralized decision-making processes, allows us to establish three main results. first, existing trade flows at the beginning of the eighteenth century had a radial structure, so the bourbon infrastructure plan only consolidated a preexisting situation. second, the development of the network did not suppose important alterations in the comparative centrality of the regions. finally, the design of the paved road network was adequate for the economic needs of the country. these findings are in stark contrast with claims that the radial structure of the bourbon roads was designed ex-novo with political or ideological objectives rather than economic ones. our methodology paves the way to further studies of path-dependent, long-term processes of network design as the key to understanding the true origin of many currently existing situations. few historical processes are so relevant to understand our present as the design and temporal development of transport networks. as these are processes with a strong path dependence (david 1985) , decisions that were made long ago continue to directly and intensely affect society in areas as diverse as economic growth (peters 2003; calderón and servén 2004; faber 2014) , territorial cohesion (badenoch 2010; crescenzi and rodríguez-pose 2012; monzon et al. 2019; naranjo gómez 2016) , urban development (weber 2012; modarres and dierwechter 2015) or electoral processes (nall 2015 (nall , 2018 . consequently, knowledge of the motivations behind the implementation of the new transport infrastructures, of the economic and territorial effects they induced and of the adequacy of their design are therefore important focuses of attention not only for academic analysis but also for social and political debate. in this area, the development of national road networks is one of the issues of greatest interest since they form the basis on which the other transport networks are implemented. the works about the european motorway network (peters 2003; badenoch 2010; crescenzi and rodríguez-pose 2012) , the chinese (faber 2014) or the american one (nall 2015; modarres and dierwechter 2015) are clear examples of this interest. improvements in transport infrastructure have been considered by centuries (ward 1762; smith 1776 ) as essentially positive, as they lead to economic growth and poverty alleviation through reductions in transport costs and the facilitation of market integration (calderón and servén 2004) . however, accessibility problems can sometimes benefit certain economically weaker territories by protecting their producers from competition from other regions with lower costs and greater product variety (peters 2003) . if the development of new transport networks is to improve territorial balance, there must be sectors capable of withstanding increased competition; otherwise, market integration can lead to self-sustaining inequality (martin 1999; asher and paul 2016) . on the other hand, the long time required for the development of road networks can generate permanent differential effects on the development of regions even if the final design is territorially neutral. during the period of network construction, traffic flows are altered, possibly inducing variations in the comparative advantage of cities that can become permanent and subsist after completion (berger and enflo 2017) or even after their disappearance (bleakley and lin 2012; aggarwal 2018) . the case we concern ourselves within this paper, namely the design and first development of the contemporary spanish road network, falls within this area of research, although it is older than the cases mentioned above. interestingly, more than two centuries after charles iii formally established the guidelines that determined its design and its present structure, it continues to be the object of lively social attention, as evidenced by the large number of recent historical (rosés et al. 2010; martínez-galarraga 2010; bel 2011; martinez-galarraga 2012; fernández-de pinedo echevarría and fernández-de pinedo fernández 2013; díez-minguela et al. 2016 ) and non-historical academic works (bel 2010; molinas 2013; holl 2011 holl , 2012 garcia-lópez 2012; martínez-galarraga et al. 2015; garcia-lópez et al. 2015; holl 2016) , as well as by the abundant news or political discussions 1 that take place around it. however, the scope of this paper goes way beyond the specific historic process we study: we are introducing a methodology combining deep and thorough historical research, computer geolocalization techniques, analytical tools of complex network science and agent-based modeling, that can be applied to many other processes. it is becoming increasingly clear that sizable gains can be realized from research that seeks to better understand how local history and context can be leveraged to inform the design of better policy (nunn 2020) . as we will see below, our methodology can be very useful in that context, providing deep insights such as assessing whether a process is pathdependent or not, identifying the effects and disequilibria of processes at multiple levels (national, regional, town), studying counterfactuals to advise policy making, and in sum, providing a faithful picture of the system of interest useful for understanding the past and handling the future. the origin of the current radial structure of the spanish road network and, in general, of its transport network, has been the object of extensive historical debate (tortella 1973; nadal 1975; artola et al. 1978; mateo del peral 1978; gómez mendoza 1982; comín comín et al. 1998; bel 2011) due to its influence on spain's subsequent economic development and its territorial imbalances, but most of the available research addresses the period after 1850. in turn, the possibility that the decisions are taken in the eighteenth and the nineteenth centuries regarding the design of the transport networks have generated a lock-in (arthur 1989 ) that has prevented the subsequent development of more efficient transport networks at present is an issue of great relevance but has not been the subject of many quantitative studies. as far as we know, the studies that have analyzed it from a quantitative perspective have been very limited and focused on the railways (equipo urbano 1972; martí-romero et al. 2020) . the works of carreras montfort and de soto are an important exception, but they mainly refer to the roman period (carreras montfort and de soto 2013; de soto 2019) . their analysis for the roads of the eighteenth century uses the work of escribano (1758) as a main geographical reference, which implies a significant restriction in terms of time reference and potential geographical bias as they acknowledge (carreras monfort and de soto 2010, p. 13) . 1 an indirect evidence of the social interest about this topic can be found by comparing the number of results obtained in google with other similar ones. so, while the searches 'carreteras and "felipe v"' and 'carreteras and "carlos iii"' obtained 174,000 and 1.160,000 results respectively, the searches 'ferrocarriles and "isabel ii"' and 'ferrocarriles and "alfonso xii"' were only 363.000 and 176.000, although these are more recent facts. the interest for non-spanish speakers is also high: 'roads and "charles iii" spain' had 179.000 results while 'roads "albert gallatin"' yields 183.000 hits and '"interstate highway system"' 1.050.000 ones (searched on january 9, 2020). the works that have provided a more direct and quantitative approach to the eighteenth-century road network make a somewhat negative assessment of it (madrazo madrazo 1984b; uriol salcedo 2001; martinez-galarraga 2012) . however, these works concern themselves more with an assessment of its results than with a detailed analysis of its layout comparing it with other alternative designs of the network. in fact, other authors accepted that there were some improvements from the previous situation (herranz-loncán 2005; grafe 2012) . in this context, it has to be borne in mind that path dependence and consequent lock-in are phenomena that occur under particular conditions and are the result of underlying, more fundamental mechanisms (vromen 1995) . therefore, it is necessary to adopt an approach of 'path as process,' in which the development of road networks must be understood as a continuous process of generation and destruction of optimal trajectories for the agents which produce a transformation of the existing socioeconomic structures and regional development trajectories (garud and karnøe 2001; martin and sunley 2006) . this aligns with what bednar et al. (2012) calls 'revised-path dependence,' which is of crucial importance in the interpretation of political and legal issues but also economic decisions. this implies as well that policy choices as the construction of new roads must be evaluated in light of decisions taken during the process, not just the final equilibrium. for our work on the origin and evolution of the radial structure of the spanish road network, these ideas lead us to ask whether the design of the network was unavoidable given the topological characteristics of the territory or whether it resulted from small events as the political decisions. from our point of view, an event generates path dependence when it alters the preexisting dynamics in the system, by exceeding its capacity of resilience. this resilience is usually high, and non-chaotic dynamic systems show a tendency to maintain and recover their dynamics, but if the actions or influences on it exceed certain thresholds they can change their evolution in a lasting way in time. in this framework, the road network can be seen as a system in which the links are continuously weakened and require maintenance to remain functional. improvements to the sections require specific investments to take place. the flows between the nodes derive from their relative position and size, but this is, in turn, affected by the network. there is, therefore, a dynamic of the development of the nodes that can be altered depending on decisions such as the order of execution of the works. this is a complex phenomenon and is multi-scale both in terms of time and space. from a historical viewpoint, the process is affected by the vision of permanence of the human being, and events that last more than one generation are usually considered to be permanent even though they are not permanent. from the spatial point of view, events that do not produce permanent changes in the system when considered as a whole can produce very relevant alterations at the local level. all this makes the consideration of an event as a path-dependent one always complicated and requires a precise definition of the reference framework and the spatial domain under consideration. from the perspective summarized above, in this paper, we use historical data, gis techniques, and network analysis, to critically examine three statements widely discussed in the literature but far from empirically validated. the first statement is that the dynastic change after the war of succession meant the ex novo implementation of a radial design of the road network that replaced a preexisting, less centralized one. the works about the issue consider that the radial structure of the spanish communications network arises around 1720 through the general regulations for the direction and government of the post and post offices of spain (uriol salcedo 1977; madrazo madrazo 1984a, b; ruiz et al. 2015) or, alternatively, and focusing on the roads, with the royal decree issued in 1761 to make straight and solid roads in spain and to facilitate trade between provinces (de gregorio -marqués de esquilache 1761). this radial structure is usually linked to a more centralized vision of spain motivated by the enlightened vision of the new dynasty (uriol salcedo 1985a, b; shaw 2009; carreras monfort and de soto 2010; bel 2011; garcia-lópez et al. 2015; lópez ortiz and melgarejo moreno 2016) . to analyze this statement more precisely, we have divided it into two hypotheses: h1a: 'the newly paved roads of the pre-rail period imply the introduction of a new network design.' h1.b: 'the changes in the road network are linked to the enlightenment vision of the state that came with the bourbon dynasty.' the second statement posits that the new roads generated significant changes in the transport network that led to improving the communications of madrid with the periphery instead of activating the growth of the interior regions (ringrose 1972; anes 1974; carr 1978; madrazo madrazo 1984b) . to verify this statement, quite consolidated in the literature though not entirely (grafe 2012), we will use three hypotheses: h2.a 'the newly paved roads produced important changes on the interregional mobility patterns'; h2.b 'the improvements in accessibility resulting from the newly paved roads were concentrated in a few regions, mainly madrid and the coastal regions, which was a comparative disadvantage for the inland regions,' or, in other words, 'those investments affected the regions differently'; h2.c 'the effects were mainly at the level of cities, not so much of regions.' finally, the third statement we address is whether the network design is compatible with diverse economic efficiency criteria. more precisely, we consider two hypotheses: first, h3.a 'the radial network designed by the bourbon monarchs of the second half of the eighteenth century emerges from a distributed decisionmaking process in which all the populations of the nation are taken into account in a non-discriminatory manner' (martinez-galarraga 2012) . second, h3.b, 'the military, political or administrative criteria outlined by various authors (ringrose, gómez mendoza, madrazo, bel) are compatible with that of maximizing commercial traffic' (ringrose 1972; anes 1974; madrazo madrazo 1984a, b; gómez mendoza 2001; bel 2011) . our research is based on three pillars. first, we carry out an in-depth historic analysis of the time evolution of the land communications in spain, from the midsixteenth century through the eighteenth century. our work presents the most detailed and well-documented study of this process to date, based on an exhaustive search of relevant maps and reports about the structure of the road network. second, we resort to complex network analysis techniques, that are becoming widely used in many disciplines, to quantify the effect of the historical evolution of the road network on the communication structure of the territory. finally, we introduce an agentbased model that allows us to explore different counterfactuals to obtain a network that would result from different decision-making processes. we compare centralized processes with other coordinated, democratic alternatives that considered the distribution of the population in the territory, assessing the relevance of politically motivated decisions on the evolution of the network. in this section, we present an in-depth study of the historic evolution of the spanish land transportation network. we aim to show that the road network in spain that emerged after the war of succession was the same from the beginning of the modern age, since during the sixteenth and seventeenth centuries investments in transport infrastructure were reduced and in most cases limited to road maintenance tasks. although it is pointed out often that the modern road network in spain has its roots in roman roads (uriol salcedo 1985a; bel 2011; garcia-lópez et al. 2015) , this statement is only partially true. to the extent that the orographic conditions, that is, the physical characteristics of the territory such as river and mountains, remained largely unchanged, as advances in construction technologies were not yet too important, 2 and many of the large population centers of the roman era were still relevant at the beginning of the 18th century, both networks show important similarities from their functionality as they give answers to similar problems and, therefore, many of their links connect the same nodes. however, from a technical perspective, they are different, as the roads run along largely mismatched routes. in most cases, eighteenth-century roads only overlap with old carriageways at the entrances to cities, on bridges, and on mountain passes that were still preserved from roman times (cf. "appendix," fig. 21 ). in any event, it is commonly accepted that the network of roads existing at the beginning of the eighteenth century was not structured around particularly important roads or cities, showing high road densities in old castile and low densities in the northwest of the peninsula (galicia and asturias) and the southwest part of the central plateau. it was also a markedly internal network, in which connectivity with coastal towns was practically non-existent, except with catalonia and the north of valencia. in terms of quality, it was quite inadequate, even by the standards of the time (dirección general de obras públicas 1856), making the widespread use of carts in the transport of goods difficult (menéndez pidal 1951; madrazo madrazo 1984a, b; uriol salcedo 2001) . in what follows, we critically consider this view and appraise its accuracy and the data on which it is based. one first point we need to consider is that the spatial relations between townsthe topology-of the underlying road network is a crucial aspect in assessing the suitability of infrastructure investment decisions. however, most studies on changes in the transport network linked to the arrival of the bourbons do not rely on cartographic information, but on books of itineraries and accounts of travelers (madrazo madrazo 1984a; uriol salcedo 2001; bel 2010; carreras monfort and de soto 2010) dating in most cases to two centuries before (villuga 1546; de meneses 1576) or already halfway through the century (escribano 1758) . crucially, maps and itineraries have different purposes, being related to either 'state' or 'process' as explained by (downs and stea 1977) . in the eighteenth century, maps offered much richer and more general information, but were expensive and could sometimes lead to confusion if they were misrepresented. itineraries, on the other hand, were simply lists of the towns located on the roads, i.e., they had much more limited but clearer information. among their main shortcomings was that itineraries cannot be connected, no matter how close they are, if they do not have towns in common; this could make travelers incur big detours. however, their reliability and low reproduction costs meant that they were widely used until the end of the nineteenth century. only in the field of postal communications have maps received some attention (líter mayayo 2005; aranaz del río 2005), although generally from a purely descriptive perspective. it is widely accepted that these itineraries did not reflect the entire network, but only a selection of roads that were relevant to the small public to which they were addressed, given the large proportion of the illiterate population. madrazo madrazo (1984a) , for example, considers that the actual network may have been twice the size of the reported one. further comparison with the information provided in the topographic relations of felipe ii (páez de castro et al. 1578) for toledo and madrid shows that a multitude of roads is missing (alvar ezquerra 1985; madrazo madrazo 1984a) . uriol salcedo (2001) gives further support to the idea that there is a lack of important roads according to other sources, such as chronicles of travelers. in this situation, itineraries remain the most suitable sources for the analysis of the structure of the road network in the sixteenth and seventeenth centuries, as they were generally more accurate than maps. this is not the case, however, for the eighteenth century, 3 in which there were already detailed road maps such as those of de wit (1700), fer et al. (1701), valk (1704 ), visscher (1704 , 1705 , van der aa (1707), senex (1708), mortier and sanson (1708) , allard (1710) , homann and remshard (1710) , mortier (1710) , and moll et al. (1711) . many of these maps, of french, english, and, above all, flemish origin, were compiled for military reasons during the war of succession (1701-1714) and precede by four decades the unfinished work of the jesuits carlos martínez and claudio de la vega, commissioned by the marqués de la ensenada (martínez et al. 1743) , the first national attempt to have a detailed map of the roads of spain. this surprising lack of attention by historians to the cartography of the early eighteenth century is likely to arise from three interrelated motives. on the one hand, it was difficult to access these materials before their digitization and dissemination on the internet. on the other hand, primary sources such as sarmiento (1757) discredited them with claims that in the middle of the century there were still no general maps of spain with the cartographic precision necessary to adequately determine the itineraries by 'air.' this amounts to saying that serious errors made it very difficult to define the general scheme of the road network, because the distances that appeared between the cities were not correct. finally, there was also an inadequate interpretation of secondary sources such as madrazo madrazo (1984a) that confuse the lack of precise general maps with the lack of reliable maps and, in any case, of more detailed and complete maps than the itineraries. despite these issues, road maps of the early eighteenth century, although still imprecise in terms of distances because the location of the cities and towns still showed some degree of error, gather complete information on the road connectivity between them. hence, by applying these connections to current maps in which the locations of the towns and cities is exact, it is possible to reconstruct accurately the old road maps. figure 1 shows the differences between the two visions, i.e., the itinerary vs the map approaches, comparing the network that appears in the itinerary of villuga (1546) with that of the map of valk (1704) . the latter clearly shows that, as in the rest of the maps of the period, the density of the network is quite high and homogeneous in all the territory and that the coastal zones have adequate connectivity with the interior, which contrasts not only with the itineraries of the sixteenth century but also with those of escribano of 1758. importantly, none of the maps analyzed in this work shows a categorization of roads, which suggests that there were no significant differences in quality between the roads included. in general, all of them were quite deficient and scarcely financed, as shown by the report on the state of public works in spain (dirección general de obras públicas 1856, p. 16, translation by the authors): villuga (1546) showing a network with a highly heterogeneous density, i.e., with zones with very many roads and other zones in which they barely exist. notice the important lack of communication with the coast, implying a degree of disconnection between the periphery and the peninsular center. right: map based on valk (1704) , showing a quite homogeneous road density across the iberian peninsula, good communications with the coast, and without noticeable discontinuities across reigns. maps by the authors from the references indicated above the roads that existed in spain before the middle of the eighteenth century were nothing more than simple paths, in which the difficult sections were improved somewhat by building bridges and other works in order to cross the main rivers. these works were executed sometimes by the munificence of the kings, other times with funds provided by the towns or gentlemen who had a direct interest in communications, and in the greatest number of cases resorting to the system of personal service. looking at the degree centrality (defined as the number of connections of a town with others without intermediate steps), we see that the results by madrazo madrazo (1984b) for forty towns in 1720 based on post itineraries (martínez de grimaldo 1720) differ from that obtained through maps (mortier 1710) . thus, in the first case, the average is 1.18, and in the second it is 3.70. concerning the number of cities without documented connections, the result from the itinerary is 52.5% while on the map it is only 5.0%. this means that the view mentioned above, claiming that a large part of the territory was disconnected from the rest and that the network was sparse, is highly questionable, because it is based on itineraries, implying much more subjective and fragmented information. the image that arises from our analysis and, in particular, from the integration of the maps of the time available shows a dense network of low-quality roads that differs little from early nineteenth century maps such as quiroga (1811) (1838), although, naturally, without the newly paved roads built from the second half of the eighteenth century. once philip v was consolidated in power after the war of succession, a profound reform of the communication and transport systems was started according to the ideas of the age of enlightenment. it is important to point out that this reform involves two different processes, 4 although with many points in common, in agreement with the fact that the administrative division was quite confusing throughout the eighteenth century. reforms in the first area focused on improving the postal system and the displacement of elites and had a markedly legal and organizational character. on the other hand, a second, different reform process was aimed at reducing the costs of transport employing important investments in infrastructures. we now discuss these two processes separately. the beginning of the eighteenth century brought about a profound transformation in the management of the spanish postal system, imposed by the communication needs of an empire still of enormous dimensions at a time of great economic precariousness. to this end, in 1706 philip v rescinded the postal monopoly of the vélez de guevara family, which, as descendants of the tassis family, had held it since the sixteenth century. from that moment on, the post was managed by private individuals until 1716, when the position of correo mayor de españa was abolished and the postal service became a centralized public service directly dependent on the crown (rodríguez gonzález 1983) , becoming an important source of income for it. spain thus became one of the pioneering countries in managing the postal service from the state and a model for other european countries. with the postal system fully in the hands of the crown, in 1720 philip v established the bases for its reorganization through the general regulations for the direction and government of the post and post offices of spain. this document, in addition to other measures, includes a list of post roads of eminently radial structure and with a center in madrid that followed to a large extent the main 'unmounted' post roads included in the 17th-century guides, cf. figure 2 ). this hypothesis that the structure of the network of post roads was not an ex novo design, but was based on the preexisting structure, is confirmed by a map made by jaillot on behalf of the marqués de grimaldo (jaillot and martínez de grimaldo 1721) in which 'all the ways and posts of spain are exactly collected and observed, following the memoirs of the major couriers of madrid.' although the date of edition is 1721, the information collected on the senders, and the fact that the information on the postage of the letters corresponds to those in force previously, suggests that it is a map that collects the situation of the network of posts before the promulgation of the 1720 regulation (aranaz del río 2005) and that it was commissioned by the marques de grimaldo to be used as a reference for its reform. with this caveat, we can now compare the jaillot-martínez de grimaldo network changes arising from the 1720 regulations. red, post roads preexisting this regulation that disappeared subsequently. blue, post roads arising after the regulation. gray, post roads preexisting the regulation that remained in use afterward. map by the authors based on martínez de grimaldo (1720) and jaillot and martínez de grimaldo (1721) map with the itineraries resulting from the regulation. the comparison shows no increase in centralism, but, on the contrary, a less radial structure with the incorporation of two important north-south axes, one following the so-called silver way ('ruta de la plata' in spanish, running close and parallel to the border with portugal) and the other following the mediterranean coastline between the french border and alicante. the connection with portugal via ciudad rodrigo and the branch to granada was, however, excluded. for reference, fig. 3 shows two maps of the post roads in the 17th century, obtained from italian couriers, showing a largely similar layout to the one arising from the regulation. to properly understand this discussion, it should be borne in mind that the definition of new post roads does not necessarily mean major changes in the structure of the road network, since it is only remodeling of the routes followed by the couriers to communicate post boxes with each other and thus transmit messages more quickly. this is apparent when we compare the post offices that appear on the map (fer and loon 1701), probably the first post map in spain, with the routes that appear over time. it can be seen that, although the routes undergo great changes, the post offices are, to a large extent, the same (cf. fig. 4 ). this largely virtual design of the post roads facilitates making profound modifications to the routes without major costs, as the location of the staging posts, the only physical element of the network, does not require major modifications. the considerations above do not mean that the definition of the post routes did not have implications for infrastructures. as they were priority routes for the crown, they received greater attention in terms of maintenance; on the other hand, as they were more traveled, they became safer and had better services (new inns), which generated a virtuous cycle that increased their relevance for transport. however, while the remodeling of the post network meant an important improvement in communications and the mobility of the most privileged social (serra 2003) . the information available from codogno (1608), left, and from miselli (1697), right, shows a structure similar to the one to be established by the 1720 regulations, with great axes connecting the center of the peninsula with the big cities in the periphery supplemented with several transversal axes. it can also be seen an east-west axis in the ebro and duero valleys that disappeared after 1720. maps by the authors from the sources indicated above strata, it did not alter to a great extent the capacities of freight transport as tracks remained mostly unpaved. in 1747, ferdinand vi issued a royal decree creating the office of general superintendent of couriers, stations and post offices, in charge of the maintenance of roads and the construction of new ones, placing under the tutelage of the postal service the responsibilities that previously fell on the mayors and quartermasters. as an important novelty, regular financing was established for the development and conservation of the main road network at the expense of the treasury (martín mora 2017). this change in the funding system was because the traditional system, based on local funding and obtained mainly through portazgos (entrance and exit tolls), was insufficient to undertake the intended expansion of the network (bel 2011) . as a result of this interest in improving the road network, the first roads were built in 1749, namely the santander-reinosa road and the guadarrama pass. however, it was not until 1761, with charles iii on the throne, that general legislation on the road network and public works can be considered as thoroughly established (dirección general de obras públicas, 1856). the royal decree issued in 1761 to make straight and solid roads in spain and to facilitate trade between provinces (de gregorio -marqués de esquilache 1761) is a clear example of enlightened action to promote trade and development. it is a brief document of four pages in which it is decided the continuation of the canal of castile and the construction of 'all the roads convenient for the common utility at the expense of the royal treasury, starting with the main ones from the court to the provinces, with a fixed assignment; that concluded these, all the others are implemented, that ensure the easy communication of some provinces with others, and even of some villages with fig. 3 ) shows the effect of the war of succession on communications. the role of madrid as the center of the network is enhanced and connections with the crown of aragón and with portugal decrease, as a consequence of the larger presence there of habsburg supporters. the fact that the borbonic side is more effective in keeping control of the big coastal cities is reflected in the connections with those, that survive the war. the 1704 map, at the height of the war, is similar to the 1701 one, except for the restoration of the connection with andalucía through castilla la nueva and the suppression of the connection to france via valladolid, with two new axes connecting directly to santiago and burgos. this map points to a radial conformation of the network, that appears to be consolidating in the 1721 map, previous to the regulation issued that year and following seven years of peace. in agreement with this, the post network takes a very radial structure, while the transversal axes remain but reduced to the unmounted category. in 1775, the network is the most centralized one, as can be seen from the disappearance of the transversal axes. a less centralized structure is recovered in 1804, with the reappearance of routes like silver way (parallel to the portuguese border), in aragón, and the france-portugal connection through valladolid. in 1821, the process is completed and all routes have become mounted. map by the authors based on de fer and loon (1701), de fer (1704), jaillot and martínez de grimaldo (1721), lópez de vargas y machuca (1760), brion de la tour and desnos (1766), espinalt y garcía (1775 garcía ( , 1804 and de ayala (1821) ▸ others' (translation by the authors). this ambitious infrastructure plan began with the construction of the roads from madrid to andalusia, catalonia, galicia, and valencia and the introduction, for its funding, of a national salt tax to which all people would be subject, 'ecclesiastical or secular, by duty, all contribute to an object that includes the common benefit' (translation by the authors). from 1761, charles iii and his successors carried out an intense infrastructure policy that laid the foundations of the current transport network. in the 1761 royal decree, there is no evidence of a decision on the part of the monarch to build a radial network; instead, the decree develops a desire to favor transport throughout the national territory. indeed, at the time of charles iii's death, twenty-seven years later, the roads built could hardly be considered a radial network (de ita and xareño 1789; madrazo madrazo 1988 ). this does not mean, however, that the construction of all roads was considered to be equally urgent. during the years before the promulgation of the royal decree, there was an active academic and political debate on how to improve the spanish transport network. the prioritization of communications between the capital and the periphery through straight roads as well as the funding of works by the royal treasury appearing in the royal decree of 1761 is largely the result of the influence of road arbiters who made society aware of the urgent need to improve the transport network (madrazo madrazo 1974) . thus, de quintana (1753) proposed the creation of a network of radial roads, starting from the main ports converging in madrid, and another transverse, distributed throughout the kingdom, which would allow a very significant reduction in transport costs. unlike other projects of the time, de quintana's proposal focuses on the means of transport, the ox cart, and the logistics system. he designed a network in which every 3 or 4 leagues oxen would be replaced to achieve greater speed and load capacity, similar to the way horses were replaced on post roads to increase the speed of letters and travelers. two years later, fernández de mesa, with his legal and political treaty of public roads and inns (fernández de mesa 1755), contributed ideas on funding that complemented those already appearing in the royal decree of 1747. he proposed that the construction and maintenance of the roads should be covered by those who benefited from them, mainly municipalities, but also by the lords and the church, something quite controversial at the time. the crown should only contribute to the royal roads 'for being the prince's and enjoy in them the protection, and jurisdiction as well as the usefulness of facilitating the prompt expedition of his orders, posts and military functions' (translation by the authors). along similar lines, miguel sarmiento had a deeply centralist vision, in which the capital had to play a fundamental role in the territorial structuring of the nation (reguera rodríguez 1999). using a purely geometric perspective, he proposed a network of 32 radial roads from madrid to the most remote parts of spain in imitation of the military routes of the romans (sarmiento 1757). the roads should be straight, avoiding the towns, so if any were on their way they should surround them. interestingly, this non-urban design will be used centuries later for the development of the german motorways. to avoid the congestion that would occur in madrid, as the center of the network, he proposed the construction of a circular road to distribute the traffic, in a similar way to the motorways that now surround the cities. one of the reports that probably had the greatest influence on the subsequent development of the road network is the economic project, which introduces various measures aimed at promoting the interests of spain, written by ward in 1760 and published posthumously in 1762 (ward 1762) . among his indications for a better government of the country, he dedicates two chapters to the improvement of transport. however, his attention is very focused on the development of river navigation, dedicating only four of the fifteen pages devoted to this subject to land communications, two of them referring to funding aspects. in general, he considers roads as a subsidiary system to the river (translations by the authors): 'since there can be no navigable rivers or canals everywhere, good roads have to make up for this lack.' it is therefore not surprising that the road needs were limited to 'six major roads, from madrid to la coruña, to badajoz, to cadiz, to alicante and the line of france, as well for bayonne's part as for perpignan's part' as the transversal connectivity was covered by rivers and canals. the choice of these roads was probably based on the importance that these roads had historically had and on the use that was made of them at that time as post roads. these radial ways should be complemented by 'different ways of crossing from one city to another, and making the king the first cost (as befits) is very just, that henceforth the ways be maintained by the villages themselves who will enjoy the benefit of this providence.' like sarmiento, ward emphasizes the importance of the straightness of the roads 'at the cost of any difficulties' because the reduction of the distance compensates with time the higher initial costs. the orientation of the infrastructure plan toward achieving full national connectivity, although generally accepted in the technical and political fields, was not, however, exempt from criticism (reguera rodríguez 1999). jovellanos, in its report on the agrarian law (de jovellanos 1795), considered that it would have been more beneficial to encourage economic activity to improve local connectivity rather than to build major national axes and not to start new projects before those already started have been completed. with the historical review of the development of the spanish road network and its evolution in mind, we now turn to a quantitative study of the most relevant issues related to it. in recent years, profound changes are taking place in the social sciences. the increase in computing power, the availability of more sophisticated software, and the growing accessibility of digitized primary sources mark the beginning of an era of huge methodological changes in the field of economic history. the decreasing costs of mass digitization of maps and other primary sources together with the development of character recognition systems including handwriting thanks to deep learning make it possible to analyze information that until recently was inaccessible or at least untreatable. thanks to these new possibilities, quantitative analysis and economic theory can be combined to advance our understanding of history in new ways (wehrheim 2019) . this is what mitchnener calls 4d economic history, digitally driven data design. the development of 4d has the potential to cover more areas and in greater depth than the exploitation of ipums, the integrated public use microdata series has been done so far (mitchener 2015) . on the other hand, a major contribution to this toolbox has been the geographic information systems (gis) that, by incorporating the time dimension, have become very useful in the field of social sciences (gutmann et al. 2018) . gis consists of an integrated database management system with a mapping tool that allows easy representation of data using maps but also integrates scattered information using location as a nexus and analyzes data spatially. this has allowed historians to handle data with a spatial component in a more powerful and sophisticated way than was common in the past (healey and stamp 2000; ell and gregory 2001) especially in the field of transport (atack 2013; perret et al. 2015) . often, in economics, interdependencies between the different aspects of the phenomena studied are not adequately considered (schweitzer et al. 2009 ), which is a major weakness since the relationships between entities matter whether they are people, objects, ideas, or anything else (brughmans 2013) . rather than focusing on these entities in isolation, the network perspective allows for an explicit examination of these entities to properly understand their behavior (wasserman and katherine 1994; watts 2003; de nooy et al. 2005) . this is particularly common in the area of infrastructures, not just transport, where very often the analysis focuses on differences in regional endowment without taking into account that effects, both positive and negative, may occur far from the location of the entity. space is not just an additional element of the economy: it constitutes, as reggiani and nijkamp (2009) say, an intrinsic feature of any geo-economic system and can give rise to the emergence of complex nonlinear and interactive behaviors and processes in a geographical environment. network analysis allows to overcome these limitations by making these relationships explicit and enabling global analyses of the phenomenon using widely contrasted mathematical tools. to this end, we have made a thorough effort to establish a correct network representing the roads, beginning of course with the proper starting point, namely the situation at the turn of the 18th century. as there is no suitable cartography for this purpose, it was necessary to compile it from the dispersed information available. valk's (1704) map was chosen as the basis for our reconstruction because it has the greatest regional connectivity among those elaborated in the first years of the century. as this map was made taking into account the most important connections between cities, it suffered from a certain lack of national integration through long-distance connections, a problem we solved by resorting to the itineraries of anonymous and pontón (1705) and de b. f. and de grieck (1704). it would have been desirable to integrate all available road information, but it would have resulted in an excessively complex network for simulations; at the same time, we are confident that the so-obtained network is a fairly accurate description of the roads in the early eighteenth century. as we mentioned above, maps from this period showed a distorted view of the road network by incorporating some errors in the location of cities and towns (planimetric problems), but this can be rectified if there are no spatial inconsistencies (topologic problems). this issue can be more easily understood by looking at fig. 5 , which shows the spatial distortions of the valk map by comparing the location on the map of the main cities with what they should have. if the map were correctly drawn, the mesh of distortions would show a homogeneous structure. the errors are planimetric because the mesh, albeit deformed, is not twisted, which would imply the existence of topological problems. the biggest problem, shared by other maps of the time, arises in catalonia, which appears too much to the east. other areas with planimetric errors are eastern andalusia, extremadura, galicia, and asturias. these errors in the knowledge of the reality of the territory could have influenced the design of the network by making politicians think that the distances between the cities were different from the real ones. however, the availability of real travel times in the itineraries could have mitigated these potential biases. to do this, all the villages that appeared on valk's map and on the itineraries of anonymous and pontón (1705) and de b. f. and de grieck (1704) were accurately represented on a modern map and the sections of road that appeared in those sources were represented by straight lines. straight lines were used due to the limitations of available information. the itineraries do not indicate intermediate points between towns, and the routes marked on the map were not reliable due to underlying representation errors. this could lead to the introduction of serious errors in the assessment of the actual distance between towns and consequently of transport costs by not considering the slopes and curves of the roads. to reduce this problem as much as possible, it was assumed that the error between the actual distance and the distance in a straight line was proportional to the roughness of the terrain covered. an approximation of the actual distance was thus obtained by multiplying the straight line distance by a roughness index. unlike what is usual, our index, instead of being based on the general characteristics of the territory (riley et al. 1999) , focuses on the trajectory followed by the road. areas with high roughness may have passes that are easily passable, and that surface-based measurements are unable to incorporate. the construction of the index is simple and intuitive: the number of contour curves crossed by the path divided by the length of the section. in this way, it is possible to overcome the lack of precise information about where the road really passes through. if the road is direct, it has greater slopes and cuts more contour curves, if on the contrary, it makes a detour it is longer but cuts fewer contour curves. figure 6 shows the network finally considered. we want to stress that this is an important result of our research in itself, given our comprehensive historical reconstruction. network of iberian roads at the beginning of the eighteenth century. map by the authors following the method detailed in the text. line thickness is proportional to the roughness index of the corresponding road. this is the network taken as a reference in the remainder of the paper to proceed with our long-term analysis, we need to assess whether or not the so obtained network remains stable. in other words, there should be no major changes in the connectivity of the territory beyond the improvement of the road sections included in the infrastructure plan under analysis. confirming that this hypothesis is correct is difficult given the lack of detailed information on the network and its evolution, since until the final third of the nineteenth century there was no official national route map with the necessary level of detail (depósito de la guerra 1865). however, the fragmented information available seems to indicate that changes in second-order roads were of little importance, maintaining throughout the period a dense network, homogeneous in density, and low quality. further evidence supporting this conclusion comes from the use of road guides as a reference. in this regard, we observe, on the one hand, that the work of escribano (1758) was published continuously, and with hardly any modification, from the third edition of 1767 until, at least, 1823, which indicates that the network did not suffer great alterations in its layout, since the guide did not lose interest for the public. on the other hand, based on this guide, lópez (1809) published four editions in which he completed and corrected the information contained, but without including any new routes, except in the last edition, dated 1828, in which, in the case of catalonia, he incorporated fifty new roads, many of them paved. this confirms the stability of the network and also that the low road density of the peninsular periphery that appears in the guides is not real, but is due to the lack of information and the special interest in the capital's connections, as shown in their titles. 5 interestingly, the comparison of the network presented by escribano and lópez with the anonymous work published in valencia in 1810 with military motifs (anonymous 1810) (annex illustration a2) shows us quite similar information, although not always coincident. there are discontinuities in some paths that disappear when the sources are integrated, which indicates that they are incomplete sources that can lead to errors if considered in isolation. it is important to mention at this point that, in terms of road cartography, there was an important void during the second half of the eighteenth century in which practically only post routes were included in the maps. only at the beginning of the nineteenth century, the needs of military logistics brought up new maps and itineraries. during the war of independence (lópéz de vargas y machuca 1808; nantiat et al. 1810; picquet et al. 1810 ) and the campaign known as 'the 100,000 sons of saint louis' (guilleminot et al. 1823) , foreign armies that operated in spain carried out cartographic works that represented an important improvement to the national maps (puyo et al. 2016) . these military maps have advantages over official ones such as those of de cabanes and gonzález salmón (1829) because they are based on direct inspection of the quality of roads instead of administrative classifications. the network shown by these maps is qualitatively similar to that observed in those made a century earlier during the war of succession: a dense network, fairly homogeneous in its distribution and with a generally low quality of roads, especially in mountainous areas. for this reason, we consider that the network we have inferred from the map of valk (1704) is a good approximation of the road system of the 18th century and the first part of the nineteenth century. before presenting our results, we must discuss a potentially important shortcoming of our work, namely the limitations of the available statistical sources that constrain ourselves to consider domestic inland road traffic only. therefore, we are not including the effects on road transport flows of trade at the borders with france and portugal nor does it cover both cabotage and long-distance maritime traffic. however, we consider that the biases introduced are smaller than what could be expected a priori due to several reasons. to begin with, although incomplete, part of the effect has been included in the analysis to the extent that in the long-term port and border traffic has an impact on the size of cities due to increased commercial activities (lee 1998; galor and mountford 2008) . on the other hand, despite the lack of reliable statistics, it can be accepted that spain's trade with france and portugal and in general with the rest of the countries was rather limited in the eighteenth and early 19th century (bairoch 1974; prados de la escosura 1986 moreira 2006) . as for colonial trade, it was important in terms of value (cuenca-esteban 2008) but not so much so in terms of tonnes transported. if compared to the traffic of national products, the impact on traffic in the network is small (anes 1983; delgado ribas 1986; oliva melgar 2004) . also, the spanish merchant fleet at the beginning of the nineteenth century (lobeto lobo 1989; dubert garcía 2008 ) was of small size compared with the land transport capacity for the period before 1750 offered by madrazo madrazo (1984a) . another relevant aspect of this issue is that the disadvantages of land transport compared to navigation were not as extensive as is usually claimed (carreras monfort and de soto 2010; scheidel 2014) . a natural experiment of the real advantages of ship transport versus land transport in the eighteenth century is given by the canal de amposta. this channel, whose stated objective was to facilitate aragon's trade with america, was begun in 1778 and would have made river ebro navigable up to some 450 km by saving the problems of navigability of the delta. the project was abandoned, however, 5 years later, when it became clear that the high maintenance costs of just 10 km of the canal or the alternative ground portage did not compensate for the potential savings in transport costs (franquet bernis et al. 2017) . about cabotage, several authors highlight its limited relevance as a means of transport compared to land transport (frax 1981; martínez-galarraga 2013) . finally, the few references to ship transport in the itineraries (anonymous and pontón 1705; escribano 1758; lópez 1809; anonymous 1810) and travel books (van der aa and jordan 1700; de laborde 1808) of the eighteenth and nineteenth centuries suggest that it was of little relevance. having made a precise description of the spanish road network at the arrival of the bourbon dynasty and the investments made in the following 150 years in the previous section, we can establish that there were no major alterations to the structure of the network. no new routes were created using large bridges or tunnels, but simply improvements to some of the preexisting roads. in quantitative terms, the interurban road network in the period analyzed was about 150,000 km long, of which only about 6000 km were improved, representing a percentage of approximately 4%. we can now use our approach to answer the previously stated hypothesis about the road structure in spain. we begin by studying the h1a hypothesis relative to whether the radial design of the newly paved roads proposed by ward and materialized in the royal decree of 1761 was an entirely new model or, on the contrary, a simple evolution of the preexisting network. to test this hypothesis, an obvious key requirement is the knowledge of the road structure inherited from the habsburgs. as discussed above, we have quite detailed information about the distribution of roads, thanks to the road maps, but we do not know how the traffic of people and goods was distributed among them because until the second half of the eighteenth century the categories of roads (horseshoe, wheels, etc.) are rarely explained in itineraries or maps. an important exception is the map of spain included in the itinerarium orbis christiani, (matal and hogenberg 1579) , considered the first european atlas of roads (wertheim 1935; lang 1950; schuler et al. 1965) in which the main roads and some of the secondary roads can be seen separately. as it is a road atlas of continental level, it can be assumed that the few roads it incorporates are the most important ones. it has to be stressed that the concept of importance must be understood from transport flows since road quality was very low in the whole spanish network. the matal and hogenberg map (1579) presents a radial structure that differs widely from the decentralized network that appears in the villuga (1546) and de meneses (1568) itineraries mentioned earlier: instead, roads are organized around six main roads that emerge from the center of the peninsula and which coincide, to a large extent, with those that would be proposed almost two centuries later (ward 1762) as the basis for the design of the road network approved by carlos iii in 1761 (de gregorio -marqués de esquilache 1761). this is an important result because it strongly suggests that the structure of the spanish road network did not emerge ex novo with the arrival of the bourbons after the war of succession. the network had already emerged at the end of the sixteenth century, without direct state intervention, as a consequence of the orography and the relative positions of the cities and towns (topological relations), and, as a consequence of this, of the movement of people and goods itself. in this way, the eighteenth-century road arbiters who defended a radial network only formalized a road structure that already existed de facto. to confirm the reliability of the network structure shown by matal and hogenberg (1579), we have compared it with the information available for two different periods (1544-1579) and (1597-1605). 6 to this end we assume, as menéndez pidal (1951) and braudel (1976) did, that the most important roads should tend to appear in a greater number of sources. in this way, the different categories of road sections are established according to the number of times they appear in the sources consulted. as can be seen from fig. 7 , the results of our analysis confirm the map of matal and hogenberg (1579) showing a radial network with center in toledo. the fact that all the available sources taken together grant toledo a central position in the sixteenth-century road network, despite not having a continuous court or an especially large population, seems to be related to some clear locational advantages. these advantages would mainly come from its position in the center of the iberian peninsula, the physical characteristics of the territory, and its relative position concerning the rest of the large cities. it is worth noting that villuga's and meneses' itineraries also imply certain support for this radial view of the network if we consider that the sections that appear repeated on the most occasions are also the most important or frequented (menéndez pidal 1951) (cf. figure 8) . therefore, our study establishes that toledo was the center of the network and, therefore, channeled a large part of the transport flows (braudel 1976) . crucially, the establishment of the capital in madrid in 1561 would only mean a slight modification of the network in red, roads recorded in all three sources; green, roads recorded in two sources, and blue, roads recorded in only one source. map by the authors based on the quoted references the successive decades, especially about the roads of extremadura and valencia that would no longer pass through toledo. in any event, this change has no implication on the h1a hypothesis of our study: according to the evidence presented above and the comparison of the road network established in sect. 3 with the data about the preexisting network, we have to conclude that the radial structure of the spanish road system can be traced back to the middle sixteenth century, and therefore, it cannot be considered an ex novo design. let us know analyze hypothesis h1.b that states that the changes in the road network are linked to the enlightenment vision of the state that came with the bourbon dynasty (madrazo madrazo 1984b; bel 2010) and that can only be sustained if changes in the network occur in subsequent years. figure 9 shows the changes in weighted road density between 1600 and 1766. as can be seen, two-thirds of a century after the establishment of the bourbon dynasty, the network remained virtually unchanged to the situation during the habsburg dynasty so hypothesis h1.b cannot be accepted: changes in the road network are not linked to the accession of the bourbons to the throne. most of the changes in the network must be placed after 1800 of which the enlightenment is not a part. in general, it can be seen that the modifications are not very important and are distributed throughout the territory. the most affected areas can be seen in graph c where the changes are directly represented. the fact that the changes in the network density were basically local suggests that modifications of the road structure of spain impacted many more cities than regions. this is an important issue, also included in our hypotheses, that warrants more discussion in the next section. the next issue that we address in this paper is the claim that the road network was designed by the bourbons to improve the communications of madrid with the periphery instead of activating the growth of the interior regions (ringrose 1972) . for a rigorous consideration of this question, we recall that there were three hypotheses to test. the first (h2.a) relates to the magnitude of the effects generated by the road plan 'the newly paved roads produced important changes on the mobility of the territory.' the second (h2.b) refers to the homogeneity of the regional distribution of its effects 'the positive effects of the construction of the newly paved roads were concentrated in a few regions, mainly madrid and the coastal regions, which was a comparative disadvantage for the inland regions' (ringrose 1972 , herranz-loncán (1766) 2006). finally, the third (h2.c) considers the scope of the effects 'the effects were mainly at the level of cities, not so much of the regions.' regarding the magnitude of investments in roads, undoubtedly, the long reign of charles iii and the first part of the reign of charles iv was important, but not so much from results. indeed, after the successful completion of the roads through guadarrama and from reinosa to santander by ferdinand vi, the period that continues until the end of the 18th century is disappointing, despite the significant resources spent from the salt tax. after half a century of intense activity, it had not been possible to build even half of the main roads, and most of the roads considered completed were impassable. in some cases, this was due to poor maintenance, and in others directly due to the application of inadequate construction techniques (betancourt 1868). such poor results are the consequence of three factors: the bad organization resulting from the lack of unity in management, the shortage at all levels of qualified labor, and the lack of financial resources, not only due to a lack of income fig. 10 evolution of the network of paved roads (1750-1850). gray, the six main roads proposed by ward in 1760, constituting the basis for the radial road structure. red, other roads that can be considered part of the same radial structure. green, transversal roads. dates of completions are approximate. map by the authors based on madrazo madrazo (1984b) but also to inadequate control of expenditure (dirección general de obras públicas 1856, reguera rodríguez 1999). however, in 1799 a general inspectorate, first assumed by the count of guzmán and later by agustín de betancourt, was created, along with a school for the training of the members of the new corps of engineers of roads and canals of the kingdom. the results of these measures were extraordinarily positive, increasing construction activity to an average of more than 300 km per year until the beginning of the war of independence in 1808. this armed conflict caused great destruction of the infrastructures and a serious setback in their development that persisted, with different ups and downs, until the middle of the nineteenth century. in this scenario, one has to look only at the speed with which the network developed to conclude that the relevance of improvements in transport infrastructures was very little in a large part of spain until well into the nineteenth century (cf. figure 10) . for a more quantitative discussion, we consider a network more or less radial as a function of the percentage of transverse connections it contains. in fig. 10 , the built sections corresponding to the six radial axes defined in the royal decree of 1761 7 have been marked in red and the rest in green. it can be seen how, during most of the period analyzed, the kilometers built of transversal roads surpassed those of radial roads (table 1 ). this is confirmed by the radiality index, defined as the quotient between the built kilometers of radial sections and the total, that shows that the period of maximum radiality coincides with the central phase of the network expansion. as the expansion is completed, the radiality decreases. furthermore, these results overestimate the construction of radial roads insofar as the connection with the center was not achieved. looking at, for example, the barcelona-la junquera section as radial is questionable when work has barely begun between madrid and barcelona. the greatest centralizing effect of construction on the network corresponds to the period 1790-1820. until 1790, most new roads were built on the periphery and were not interconnected, with the result that most of the benefits table 1 construction activity and radiality of the new paved road network. source: own elaboration. the data are an approximation of the reference years based on the data provided by madrazo madrazo (1984a) remained in the different regional areas. only after 1820, the closure of the links between the radial roads began to redistribute territorially the improvements in mobility. we thus see that the transversality of the network is not invariable, but rather it depends on time and the frame of reference. the network, seen as a whole, is a fairly homogeneous mesh that cannot be considered radial. however, the situation changes if we consider the importance of connections for people, both from transport flows and the quality of infrastructure. for the flows, a radial structure centered on the middle of the peninsula was observed from the sixteenth century onwards, although there were no notable differences in terms of the quality of the roads. with the bourbon road plan, the paving of the busiest roads was improved. the sequential implementation of the network generates temporary distortions in the relative quality of the roads that alter the flows and, therefore, the commercial advantages of the towns located in it. as we discussed in the introduction, such variations in advantages can be consolidated, remaining after the network is completed and the differences in infrastructure provision disappear. network analysis tools, applied to our rich dataset, make it possible to quantify the effect of these changes on the logistical advantages of the territories using two main indicators: accessibility and betweenness centrality (barthélemy 2011) . we address first the accessibility, defined as the closeness centrality of each city, which is, in turn, the inverse of the sum of transport costs to all other cities of the network through the optimal paths. being an unweighted measure, it has few information requirements but has the disadvantage of considering transport costs for all destinations equally rather than focusing on those with more intense relationships. an accurate assessment of the effect of road improvements on transport costs would require detailed knowledge of the percentage of different uses on each road section. unfortunately, this information does not exist so it must be established based on indirect information. therefore, we have used an approximate value of 25%. we think this value fits a wide variety of situations. horse-drawn couriers can benefit greatly from improved road surfaces while a heavy ox cart on many stretches would not appreciate the difference in speed because the restriction is made by draft animals. a 75% time saving from improving a mountain pass becomes only 7.25% if the road section is 100 km and 2.25% if it is 300 km. the used parameter is a generous value that does not underestimate the effect of investments. in any case, the results are robust against variations of this parameter at least in the range 15-30%. figure 11 shows that there were practically no changes in the accessibility of the network until well into the nineteenth century, which means that there were no major changes in the transport costs of the different regions until the interconnection of the new roads made it possible to take full advantage of their possibilities. the development of the network of newly paved roads did not alter the advantage in transport costs that the center of the peninsula had shown since the sixteenth century. however, as it is often the case, there are two sides to every coin. while we have seen that the design had practically no impact at the level of the territory, it did have quite some impact at the level of cities. in fact, from regional competitiveness, the relative advantage within the region may be more important than the absolute advantage on a national scale. fig. 11 effect of network construction on accessibility. accessibility is on absolute values, the same scale for all maps, and is calculated as described in the text. each point is a town (node) that is part of the road network. the color indicates the total cost in terms of time to go to all the other villages. blue is the lowest accessibility and yellow the highest fig. 12 relative accessibility. the values reported in the maps are relative to the máximum value of the accessibility in each period. values are color coded with dark green corresponding to very low values, green, low, yellow, average, orange, high and red, very high figure 12 shows the relative situation of the different towns and cities in relative terms, i.e., comparing them with the accessibility of the best-communicated nodes at any given time. in this way, it is possible to observe changes in the relative advantages of the territory. before the construction of the newly paved roads, the advantages were basically positional-to be near the center of the territory-, but subsequently, they become dependent on the proximity to the improved network. as we can see, these variations are eminently local and should be attributed to specific cities and towns rather than to regions as a whole. from a more quantitative perspective, variations in accessibility at the regional level can be estimated from weighted averages of the values of the constituent cities. however, the allocation of changes at the regional level could be biased, depending on whether the regions have changes in their territories during the period under consideration (arbia 2001) . between 1700 and 1850, the territorial division of spain underwent important modifications as witnessed by the political maps of the time (sanson and tavernier 1641; sanson 1700; homann 1750; lópez de vargas y machuca 1802; martín de lópez 1834) so it is necessary to choose an invariant region of reference (brown et al. 2005) . this is an important issue as the results of the aggregations will depend on it. the use of the current territorial division in the analysis does not seem appropriate as both local identities and internal borders affecting trade in the eighteenth and first half of the nineteenth century do not exactly correspond to this division (fig. 23) . additionally, the consideration of madrid and its surroundings as a separate region from new castile is a recent fact that does not occur until 1983. in view of this, the map we have selected as a regional reference for this work is that of sanson (1700) as it refers to the beginning of the period analyzed and is the map that martínez de grimaldo (1720) used to establish his post routes. in order to assign the villages included in the analysis to the different regions, the map was rectified to adjust it to the spatial reality of the territory. the data in table 2 show how the general accessibility of the territory in 1850 was significantly higher than in 1700. on a national level, the increase was close to 200%. regional differences range from around 25% to the average and do not show a clear geographical pattern. the table shows that the two castiles, valencia and granada see little change in their relative accessibility concerning the national average. some coastal regions such as catalonia, andalusia, biscay, or guipuzcoa experience notable improvements while others such as asturias, galicia, or murcia worsen. the rest of the inland regions, león, navarra, and la rioja have diverse behavior. the evolution of the variability of regional accessibility, measured by the coefficient of variation, shows a slight increase during the period studied, going from 0.14 to 0.18. these results are highly influenced by what happens in the small regions of the north that appear marked in the table with an asterisk (navarra, guipuzcoa, vizcaya, and la rioja). their small size means that the construction of new roads has a very strong effect on them as they affect a very high percentage of their territory. if we group them into a single region, the coefficient of variation becomes much more stable, going from 0.16 in 1700 to 0.17 in 1850. our analysis leads us to conclude that the verification of hypothesis h2.a regarding whether 'the newly paved roads produced important changes on the mobility of the territory' is ambiguous. the roads significantly improved the accessibility of the whole country. they brought about a very important change in the mobility of the territory but with a fairly homogeneous spatial distribution not generating major alterations in the relative position of the different regions. there were no changes that turned well-communicated areas into bad areas or vice versa. clearly, this must be understood in a relative sense since they all improve. instead, hypothesis h2.b about the positive effects of the construction of the newly paved roads were concentrated in a few regions, mainly madrid and the coastal regions, introducing a comparative disadvantage for the inland regions, can be rejected for several reasons. firstly, madrid did not exist as a region until the end of the 20th century, and therefore, until the institution of the provinces in 1833, it can only be spoken of as a city within the region of new castile. this, like old castile, hardly sees its relative position altered in terms of accessibility. it is true that the city of madrid and some of the surrounding towns experienced a significant improvement in its communications, but other important cities in new castile see as their relative position worsen in economic terms, as we will see later. secondly, and more importantly, the alleged comparative disadvantages faced by the center concerning the peripheral regions are not general-some regions are improving but others that are worsening-and on average they are practically non-existent. to more clearly illustrate this point, fig. 13 summarizes table 2 classifying the regions according to the growth of their accessibility. regions that show significantly below-average variations are represented in blue or pale green, while regions that exceed it are in red or orange, and those that have grown around the national average are colored in yellow. the map shows how the peripheral regions do not improve their overall accessibility with respect to those in the center. although there are regions that improve, there are also regions that get worse. the markedly local origin of the variations in the accessibility of the territory can be noted by comparing the color of the regions with that of the populations that make them up. in many regions, some areas behave very differently from the regional average and are therefore not representative of them. let us now turn to the second indicator introduced above, betweenness centrality, which sheds much light on the impact of the design on cities and shows that the effects of infrastructure investments are not always positive. intermediation or betweenness centrality is the number of optimal routes linking pairs of cities and towns that pass through a given town. this is a predictor of the likely traffic of people and goods passing through a village from other places (barthelemy 2018) . table 4 in the "appendix" as it is not a weighted measure, it does not allow us to determine exactly how much traffic passes through the village, but it does allow us to know how many different origins the travelers have if they follow optimal trajectories in their journeys. these data are important from an economic point of view as it is correlated madrazo (1984a) with the authors' calculations of the centrality. the size of the circles represents the value of their centrality both with the arrival of travelers and with the variety of products traded in the town and, therefore, with the breadth of its market. although every time a road is improved it increases, to a greater or lesser extent, the accessibility of all the nodes of the network, the induced change in terms of betweenness centrality is a zero-sum game that involves improvements for some nodes and worsening for others. changes in traffic flow caused by investments improve the situation of certain nuclei to the extent that they receive more travelers and goods because they are located on a greater number of optimal routes. on the contrary, the nuclei that reduce their betweenness centrality lost economic relevance with the fall in traffic flows. to study this issue, and to have a reference of which were the most important roads at the beginning of the eighteenth century, the information of anonymous and pontón (1705), de b. f. and de grieck (1704), and the map of fer et al. (1701) were integrated similarly as it was done to rebuild the sixteenth-and seventeenth-century networks. as expected, both maps show great similarities. only the best connectivity with portugal in the southwestern area and the consolidation of the north transversal axis santiago-barcelona stand out as differences. the analysis of the changes in betweenness centrality confirms that already in 1700 the structure of potential traffic in the network was radial and with the center in madrid (cf. fig. 14) . the six main axes were complemented by eight others: two of them running north-south, the silver way and the one that follows the mediterranean coast from barcelona to alicante; another two, east-west, the duero axis and the león-zaragoza axis; and four transversal ones, burgos-cáceres, burgos-cuenca-el campillo, zaragoza-teruel-sagunto and albacete-ciudad real. these fourteen axes totaled around 7000 km, of which the six radials accounted for approximately half. the strong correlation between the classification of roads obtained from historical sources for 1700 and the centrality map of intermediation built from an unweighted road network is a strong support for the above hypothesis that in the eighteenth century the differences between roads were not due to differences in quality but to differences in traffic. starting from this situation in 1700, the development of infrastructures in the period 1700-1850 meant greater commercial activity in the cities of the radial axes and, consequently, a decline in those located on the complementary axes, except those of the mediterranean, which maintained their potential traffic practically unchanged, and those of the santiago-barcelona axis, except in the central section león-logroño. the key point here is that these changes happen at the territorial level too, and nodes that were very important within a region also lose power or relevance versus other nodes that are becoming local hubs. table 3 shows the evolution of the concentration of betweenness centrality by regions. infrastructure investments meant that potential traffic was increasingly concentrated in the small number of cities along which the new roads ran. at the national level, the new roads led to an increment of the concentration of betweenness centrality by 20%, from 0.67 to 0.81. the reduction in the coefficients of variation of the gini index indicates a tendency to homogenize the concentration of centrality among the region, suggesting that changes were primarily at the city level rather than at the regional level. indeed, the distinct effect of the development of the road network had on different towns is depicted in fig. 15 . the plots show the change of accessibility and betweenness centrality for some of the main cities in the period 1700-1850. the upper panel shows the evolution of accessibility and betweenness centrality for the main cities as improvements are made to the road network. it is interesting to see how accessibility always improves but betweenness centrality can be suddenly reduced as a result of changes in traffic flows. the central panel in fig. 15 summarizes the previous plot and demonstrates that, in relative terms, the city that benefited the most from the work on the roads in this period was barcelona followed at a certain distance by madrid, granada, and cordoba. instead, for the rest of the cities, betweenness centrality remained basically constant or even reduced, the improvements being restricted to their accessibility. additional information is provided in the lower panel of fig. 15 , where towns connected through radial roads are depicted as green symbols, and otherwise, they are shown in blue (madrid is highlighted in red), with symbol size indicating the distance to madrid. it can be seen from this plot that towns that profited most from the road-building activity in the period are those that are far from the center, and of course those lying along radial roads. the benefit comes mostly from the increase in accessibility, which means their distance to the towns in the network decreases in general, rather than from betweenness centrality, as being in the periphery makes it very difficult for them to be in many shortest paths. again, the case of barcelona is special in this respect, with a large increase in betweenness centrality due fig. 15 improvements to the road network and change in accessibility and betweenness centrality (1700-1850). upper panel, time evolution. circles filled with stripes represent the starting point in 1700, diamonds correspond to the situation after the first investments in 1750, triangles to 1780, squares to 1820, and circles to 1850. colors correspond to different towns as indicated in the plot. middle panel, relative changes in centralities in the period considered. the color code is the same as in the upper panel. lower panel, same as the previous panel but towns are shown in green if they are included in radial roads and in blue otherwise (madrid is shown in red as the center). the shape of symbols indicates broad geographical locations: triangles in the center of the peninsula, roughly castilla, squares in the east (crown of aragón and murcia), diamonds for the south (andalucia and extremadura), and circles for the north (galicia, asturias, país vasco y navarra). the size of the symbol represents the distance to madrid to the improvements along the mediterranean coast as well as in the connection with zaragoza, which made it more advantageous to travel from the east to madrid through barcelona and zaragoza. as noted above, a very important point to make in connection with this analysis is that variations in betweenness centrality are not always positive, as shown by the upper panel of fig. 14 . indeed, they can even be initially of a sign and change when a new development of the network takes place. for instance, teruel saw how the new roads reduced the potential traffic that passed through it until 1820 when the initiation of the zaragoza-sagunto axis allowed it to recover a good part of the lost centrality. the opposite is the case of valladolid. until 1820, its position was slightly improved, but the completion of the radial connections with burgos and benavente meant a drastic reduction in its position advantages. even more dramatic are the cases of toledo and cuenca, clear examples of how infrastructure development can lead cities to logistical irrelevance. toledo had already lost a good part of its position advantages with the opening of the direct road to the south by aranjuez, but the development of the road to andalusia meant a reduction of its centrality as it was built. the closure of the valencia road by the utiel-requena route in 1820 meant a serious deterioration of cuenca`s centrality which had already been worsening since the beginning of the development of the network. the construction in 1850 of a large part of the connection with the barcelona road only meant a reduced recovery of potential traffic, as it only managed to redirect the traffic generated around guadalajara. this is an aspect that is scarcely considered by decision-makers and by the citizenship at large, not only in the eighteenth and nineteenth centuries but today: while road improvement always leads to an increase in accessibility (travel times can only decrease from road improvements), its effect on betweenness centrality and hence on the relevance of a town for many important aspects, e.g., as a trade hub, is highly non-trivial and can be positive or negative. the greater relevance of betweenness centrality versus accessibility when taking advantage of the new roads is highlighted by ghani et al. (2016) for india. although the new roads improve accessibility in a fairly homogeneous way in the territory, most of the benefits are obtained by the towns located on the road. this is further shown in fig. 16 , which shows a general view of how the gradual development of the network altered betweenness centrality of the villages as the different phases took place. these results are consistent with those recently obtained by other authors (berger and enflo 2017) in the sense that transport infrastructure development does not necessarily generate territorial convergence and that transitory shocks generate path dependence in the location of economic activity (bleakley and lin 2012; jedwab and moradi 2016; jedwab et al. 2017) . finally, let us consider hypothesis h2.c, namely that 'effects were mainly at the level of cities, not so much of region.' in light of our analysis of centralities, we have to conclude that hypothesis h2.c is confirmed. although the improvements in terms of accessibility were spread out territorially quite homogeneously, the improvements in the road network significantly modified the trade flows which made the cities and towns along the roads gain importance to the detriment of other cities. this phenomenon is particularly intense at the beginning of the development of the network when the number of improved routes is low because it generates a more intense concentration of traffic. as the concentration increases, the advantages of these cities are reduced from a logistic point of view but can be maintained thanks to the path dependence. temporary advantages can lead to cities developing permanent advantages such as population size, the establishment of fairs, or simply the development of travel habits that keep traffic on the same routes despite the existence of new alternatives. this means that it may be appropriate from an economic policy point of view to alter the optimum sequence in the construction of the network by prioritizing less efficient sections if a territorial rebalancing is to be achieved (pablo-martí and sánchez 2017). figure 17 summarizes the effects of road network improvements on inequalities at the regional level using a β-convergence analysis (barro and sala-i-martin 1992) . in brief, β-convergence is a procedure that allows quantifying the degree of convergence in accessibility. the horizontal axis shows the regional accessibility in 1750 as a deviation from the national average and the ordinate axis shows the average growth of accessibility in the period 1750-1850, also as a deviation from the national average. the low value of the regression slope (0.08) and its scarce explanatory power (r2 = 0.0016) suggest that the construction of the network did not imply important variations in the interregional differences. the potential balancing effect of lower accessibility growth in inland regions (in yellow in the graph) was compensated by very different behaviors among coastal regions. besides, the size of the points indicates the variation in intraregional concentration measured by the gini index of the intermediation centrality. last, but not least, we consider whether the radial network design that was chosen in the second half of the eighteenth century, and especially in the first half of the nineteenth century, was based on economic rationality or whether, on the contrary, it only intended to achieve certain political and administrative objectives. to this end, the empirical network will be compared with the one that would have arisen through decentralized decision-making processes that sought greater efficiency in the transport system. it is not appropriate to speak of a unique criterion of optimality for transport networks, but of many, since the definition of optimality depends both on the objectives set and on the decision-makers. thus, the final assessment will be determined by the decision-making process and, in general, by the system of aggregation of preferences used. it is therefore not possible to determine whether the transport system with its center in madrid, which began in the eighteenth century, is the optimal one: the proper way to frame this question is to ask whether it is among the most reasonable choices for a wide range of assumptions. as the choice of the criteria to optimize and the decision-making system can be very different and influenced by many subjective judgments, we decided to develop a general model that makes it possible to evaluate different options for transport networks using a counterfactual analysis to obtain valid conclusions in the widest possible range of situations. this research aligns with the rapidly growing application of counterfactual analysis using computer simulations in the study of transport networks, even if as of today it is still incipient. such approaches have been the subject of two major criticisms in the past, but are no longer valid to a large extent (casson 2009 ) as we discuss in what follows. it has often been argued that computer simulations cannot be detailed enough for the counterfactual to allow a reliable comparison with reality. however, the current availability of information sources and calculation capacity makes it fully possible to carry out this task. projects such as aimsum (barceló et al. 2005) or mat-sim (horni et al. 2016) show how current simulation systems allow a much better approach to reality than with traditional methods. on the other hand, it has been pointed out that simulation does not allow us to get close enough to reality since it ignores important physical or technological limitations that can seriously alter the analysis. this criticism does not take into account one of the most positive characteristics of this approach and that is that simulation, unlike other methodologies, allows for the incorporation of new restrictions as they become necessary without having to substantially alter the conceptual framework. basically, those two objections are largely superseded by models having enough granularity and versatility that can be run efficiently on the big computing infrastructures available today. this makes computer simulation one of the most appropriate ways to analyze the formation and evolution of transport networks, especially in environments with scarce information such as historical analysis. 8 an alternative approach could be the use of bio-inspired techniques, that in recent years have experienced a notable development. thus, nakagaki et al. (2001) showed the computation capabilities of mud mold-physarum polycephalum to solve problems of determining optimal paths. the results obtained showed a structure comparable to the real one in terms of efficiency, fault tolerance, and transport cost. this methodology has been widely applied for the analysis of transport networks and lately also in the historical analysis (strano et al. 2012; adamatzky 2012; evangelidis et al. 2015) . the evolution of the spanish road network has also been analyzed by mud mold (adamatzky and alonso-sanz 2011) . however, this methodology shows serious problems for its direct application in practical cases that go beyond mere conceptual frameworks given the high sensitivity of the results to variations in the diffusion of chemo-attractors that determine the development of plasmodium. to avoid these limitations, biosimulation is being replaced by computer simulation using cellular automata (tsompanas et al. 2016 ). 9 once we chose to work using computer simulations, the next step is to establish reasonable criteria for the development of the road network and to observe whether the results arising from our approach under each of the criteria differ significantly from those observed in reality. for this purpose, we rely on an optimal transport network generation model developed by pablo-martí and sánchez (2017) , which allows the selection of the sections to be improved from a preexisting network to achieve maximum benefit from a global perspective based on the individual preferences of each of the agents, the optimization criteria, and the voting system. the model starts from the situation of the road at a given time and builds the improved road by choosing one segment at a time according to the different criteria. for the application of interest here, namely the optimality of the improvements of the spanish road network during the eighteenth and nineteenth centuries, we resort as the preexisting network to one built based on information from the map of valk (1704) and the itineraries of anonymous and pontón (1705) and de b. f. and de grieck (1704), as mentioned above, made up of 2343 independent road sections. although the model allows any type of decision-making agent to be established, in this case, the agents are the 1815 towns considered, i.e., the nodes of the network. the weights of the influence of each of the nodes in the selection processes were made based on transport costs and population. 10 for the largest towns, 12.5% of the total, the demographic information provided by correas (1988) was used as a proxy for this variable. for all other cases, class markers were used according to the categorization used by valk (1704) in its map. fig. 18 optimal networks depend on the city they refer to. calculation of the optimal networks for three cities: barcelona, seville and valladolid using the procedure described in the text, based on the dijkstra (1959) algorithm 9 other approaches such as those based on route optimization on cost maps (voigtlaender and voth 2019) or network generation (faber 2014; prignano et al. 2019 ) have the disadvantage of only taking into account local optimization, so they have difficulty in assessing whether intermediate nodes should be incorporated even if they increase the length of the route or consider weightings for the nodes. 10 except in the case of the unweighted model that only transport costs were used. the calculation of the optimal routes is done by applying the dijkstra algorithm (dijkstra 1959) . each node explores the different alternative routes that link it with the other nodes and selects the itineraries that imply a lower transport cost. by way of example, the network that three cities would choose if the decision was made for their benefit alone is presented in fig. 18 . note how tree structures are generated in all cases, distributed radially, and starting from each of the nodes. however, these optimal structures have few sections in common. this is why the construction of a reasonable network that aggregates the preferences of all the nodes is done below through a voting system using different viewpoints. in what follows, and for the sake of simplicity, the total cost of transport is expressed in the model in terms of time, although it would be possible to use other criteria such as monetary ones. about traffic, an estimate is made using gravitational models that take into account both the size of the pair of nodes considered and the squared distance that separates them. based on this information, in every period each node chooses the section that would be most convenient for it to be improved from its accessibility to the rest of the nodes and the section that would produce the largest increase in its traffic. to this end, each node calculates the variation in total transport time and traffic that would result from the improvement of each of the sections in which it has not yet occurred, choosing, in the first case, the one that minimizes it the most and, in the second case, the one that maximizes it the most. thus, agents seek to maximize the benefits derived from the improvement of the network without taking into account the conduct of the rest of the agents, which is why strategic or collusive behavior is not considered in the model. the generation of the improved network is done through the following iterative process. first, each node distributes its vote among all the sections in proportion to the benefit it brings as discussed above. next, the stretch that has obtained the most support is selected and improved. improving a section in the model means that the travel time between the two towns on the section is reduced by 25 percent as done in sect. 4.2. once this has been done, the agents vote again among the rest of the sections, selecting successively the sections to improve until there are no more or the established budget is exhausted. this being a sequential process, the first investments made affect subsequent decisions in a clear example of path dependence. we want to emphasize that this does not necessarily mean that the process is not optimal because a one-time design would hardly be carried out simultaneously so different sections could be disconnected and the advantages of the network could be diminished. even if the result was optimal it would be greatly affected by variations in the budget or changes in the project. whether the planned network was not completed for some reason or, on the contrary, had to be extended, a suboptimal network would be obtained. the sequential approach ensures that the design is optimal for the level of development and investment achieved. to assess the optimality of the network having in mind various economic objectives, three voting systems were established: a. equal voting power. all towns have the same voting power, and their objective is simply to reduce the travel time to the rest of the towns. for this purpose, each town calculates the travel time to each of the other villages for each of the net-work improvement alternatives. once the advantages have been estimated, they distribute their votes among the different alternatives in proportion to the expected benefits. unlike other works (adamatzky and alonso-sanz 2011; equipo urbano 1972; vanoutrive et al. 2018) , this procedure involves an intense effort of calculation as each town must estimate the optimal routes and consequent transport costs or gravitational models for each of the other 1814 towns, for each of the 2343 sections that can be improved, for each of the 300 investments and the three voting systems. this means a total of 1.1 × 10 12 optimal routes, transport cost calculations, and gravitational models. this eminently territorial selection criterion would lead to an optimal network for the transmission of news in a country where there is no telecommunications infrastructure. however, it would be inappropriate if the objective were to promote commercial activity between large population centers, as this criterion does not take into account the importance of traffic between nodes. it would be ideal, for example, to define the post roads. b. population-weighted power. in this case, towns and cities also try to reduce travel time to the rest of the localities. the voting process is similar to the previous case but the voting power of each town is weighted by their population. this criterion has smaller economic applicability than the other two insofar commercial activity is being considered, but only partially because the negative effect of distance between towns is not included. this is more often the case in areas such as cultural dissemination rather than trade and transport. the distribution by place of publication of the books in a library, for example, depends more on the economic and cultural importance of the cities than on their distance. 19 optimal transport networks for the eighteenth century according to the three voting procedures discussed in the text. initial preferences (above) and optimal sequential development (below). the maps at the top show the initial preferences of the populations according to the three criteria. the 300 sections with the highest preference are marked, in order of importance, in red, blue, and finally in black. the final result is quite representative of the initial preference structure but shows a higher level of coordination c. traffic-oriented voting. towns try to maximize the traffic of goods reducing the cost of transport on the most used road sections. it would be the most appropriate system to capture the economic aspects of the network. each town estimates for each road section the effect it would have on its trade with the rest of the towns whether it would be improved by using gravitational models. it is clear, once again, that the structures rationally optimal from an economic point of view are multiple (kolars and malin 1970; rietveld and van nierop 1995) and depend on the objectives set by the intervening decision-makers and the system chosen to reach consensus. figure 19 shows the results obtained using the three criteria indicated above for 300 improved sections, representing 12.8% of the initial network. for a video illustrating the process of network formation, cf. https ://vimeo .com/36275 3400/d23a0 dcabb . the three simulations offer very similar results, indicating that the network obtained is fairly robust with respect to variations in decision criteria. it is therefore an adequate network from the point of view of people's mobility, news transmission and freight traffic. the great similarity of the networks generated by simulation with the network at the beginning of the eighteenth century can be assessed by comparing both networks, as shown in fig. 20 . this also allows us to conclude that both hypothesis h3.a, 'whether the radial network designed by the bourbon monarchs of the second half of the eighteenth century can emerge from a distributed decision-making process in which all the populations of the nation are taken into account in a non-discriminatory manner,' and h3.b, 'it is determined whether the military, political, or administrative criteria outlined by various authors are compatible with maximizing commercial traffic' are correct. these results provide evidence that the radial design followed by the spanish road system during the habsburg dynasty, and which was later used in the eighteenth century to establish the post roads and the bourbon paved roads, was considerably efficient, despite having emerged in an uncoordinated and spontaneously way. the main difference observed between the reality and the simulations is the lack of direct connection with france by irun. this postal route, the mala, does not appear in the simulation because, as discussed above, the model applied only considers national connections. the port and border effects were not included in the weightings of the peripheral populations because reliable information on the magnitude of the effects was not available. this design of the transport network has remained largely valid, as evidenced by its subsequent use in both the deployment of the rail network in the nineteenth century and the road and high-speed train networks of the twentieth century. concerning the questions raised in the introduction about the nature of the dynamic process, we can conclude that the development of the network coincides to a large extent with the country's territorial constraints. that is, it can largely be explained by the initial parameters. insofar as the construction criterion was eminently pragmatic, the road sections were improved, showing 10 greater use in an endogenous and non-intervened manner. however, the development of the network in response to these determining factors was not the only alternative. another network design could have been chosen or followed a different rhythm or sequencing which would have led to different results, especially for the relative importance of cities and transport flows, so it can be considered a genuine path-dependent process. in this paper, we have presented a combined methodology incorporating gis techniques, network science, and agent-based modeling, exemplified through its application to understanding the evolution of the road network in spain in the period 1700-1850. as a first result, our thorough review of the available literature taking advantage of the availability of digitalized maps from the period has led to a description of the roads in that period with unprecedented detail and accuracy, particularly of the situation at the beginning of the eighteenth century (fig. 6 above) . building on that data collection and processing, we have critically assessed several claims that have been widely discussed in the literature without reaching a definitive conclusion. in this section, we summarize our conclusions regarding those claims and close with a brief discussion of other possible applications of our methodology. firstly, it has been claimed by many historians and politicians that the accession of the bourbons to the spanish throne was accompanied by the implementation of a radial design of the road network superseding the less centralist one existing at the beginning of the eighteenth century. instead, our analysis does not support this claim, because as we have seen the successive bourbon kings and officials largely adopted the sixteenth-century design they found in 1700, and we have demonstrated that this design had already a marked radial character. secondly, another commonly held view assumes that the new monarchy introduced changes in the road network that altered the development of the different territories of spain, promoting the capital and forgetting peripheral regions. we find that the bourbon infrastructure policy neither generated significant changes in the transport network nor induced major alterations in territorial balance or privileged the region where the capital was located. during the eighteenth century improvements in the network were sparse and unconnected, so the effects were mainly local. in the first half of the nineteenth century, construction accelerated, almost simultaneously connecting many of the previously unconnected sections. therefore, the potential impulse to the capital due to the implementation of an improved radial structure was accompanied by the construction of relevant transverse connections, resulting in a more or less homogeneous improvement of conditions and commerce along the whole of spain. thirdly, building on a network description of the spanish roads as we have mapped them, and using agent-based simulations with different criteria to choose which parts of the network should be improved and when, we have also shown that the design followed both in the network of post roads and in the newly paved roads was carried out with reasonable economic criteria, both from the transport of mail and the mobility of people and the transport of goods. indeed, a similar structure for the roads arises from the three largely different criteria we have considered as possible counterfactuals, indicating that the actual road network existing by the end of the period considered is compatible with many objective economic optimization criteria. in closing, we want to stress that the increasing availability of historic material in digital form, and particularly of maps amenable to gis analysis, makes our approach widely applicable in the study of socioeconomic historic developments. as we have seen, having an in-depth knowledge of the evolution of the process of interest allows establishing to which degree it is path-dependent, as well as the main effects of the historical sequence of interventions. thus, rigorous, evidence-based approaches involving a detailed review of the new available primary sources, including the cartographical ones, can shed light on the origins of many critical infrastructures that cannot be understood without taking into account their path dependence evolution. a network science analysis based on centralities points to different contexts, in a multilevel structure, where the consequences of the decisions taken can even lead to opposite results, including towns getting better positioned for commerce in a national context while their regions see their options curtailed. of course, other network science tools may be applied to a well-established historical sequence to look for the effects of other types. the use of all those diagnostic tools, combined with agent-based modeling flexible enough to incorporate different decision-making procedures, may prove very useful in dealing with the future development of infrastructures and related systems. our methodology provides a guide to take many arguments out of the political arena and into socioeconomic reality in a manner that can be used to benefit society at large. the fact that we have considered the specific case of roads in the eighteenth century spain should not hide the applicability of the method; in particular, we envisage that in the context of cities (batty 2013) our approach can be very fruitful. the available information is incomplete and spatially biased (madrazo madrazo 1982) . specifically, the map used as a regional reference for all the periods is that of (sanson 1700) as it refers to the beginning of the period analyzed and is the map he used (martínez de grimaldo 1720) to establish his post routes. to assign the villages included in the analysis to the different regions, the map was rectified to adjust it to the spatial reality of the territory (figs. 21, 22, 23, table 4 ). the main changes in the regional division of spain during the period according to the cartography of the time: • sanson and tavernier (1641): extremadura as an autonomous region. • sanson (1700): the same division as that of 1641 but extremadura is now part of new castile except for the north of caceres which is included in old castile. • chatelain (1706) : extremadura is an autonomous region, although it is the only region without its coat of arms. the north still belongs to old castile. • homann (1750): extremadura integrates its northern zone. the world's colonization and trade routes formation as imitated by slime mould rebuilding iberian motorways with slime mould do rural roads create pathways out of poverty? evidence from india a (1710) weg-wyzer der legertogten in spanje en portugaal = theatrum martis in hispania et portugallia darinne verzeichnet seindt alle die wege, so gehen ausz 71. den vornembsten städten von teutschland, 17t. von niderlandt, 39. von frankreich, 29 von italia, und 31. von hispania el antiguo régimen: los borbones. alianza editorial historia económica y pensamiento social: estudios en homenaje a diego mateo del peral tanto por los caminos de ruedas, como por los de herradura. francisco brusòla, valencia anonymous, pontón p (1705) guía de caminos para ir, y venir por todas las provincias más afamadas de españa, francia, italia y alemania. añadida a la regla general para saber adonde se escrive los dias de correo modelling the geography of economic activities on a continuous space competing technologies, increasing returns, and lock-in by historical events los ferrocarriles en españa. 1844-1943-ii market access and structural transformation: evidence from rural roads in india on the use of geographic information systems in economic history: the american transportation revolution revisited myths of the european network: construction of cohesion in infrastructure maps geographical structure and trade balance of european foreign trade from 1800 to 1970 microscopic traffic simulation: a tool for the design, analysis and evaluation of intelligent transport systems betweenness centrality the new science of cities revised-path dependence infrastructure and nation building: the regulation and financing of network transportation infrastructures in spain (1720-2010) locomotives of local growth: the short-and long-term impact of railroads in sweden agustín de betancourt_ año de 1803. revista de obras públicas portage and path dependence en que estan descritos todas las veguerias el mediterráneo y el mundo mediterráneo en la época de felipe ii. tomo 1. fondo de cultura económica file:carte _despa gne_et_de_portu gal,_compr enant _les_route s_de_poste _et_autre s_de_ces_ deux_roiau mes path dependence and the validation of agentbased spatial models of land use thinking through networks: a review of formal network methods in archaeology the effects of infrastructure development on growth and income distribution (english). policy, research working paper historia de la movilidad en la península ibérica: redes de transporte en sig the roman transport network: a precedent for the integration of the european mobility the efficiency of the victorian british railway network-a counterfactual analysis carte historique et geographique des royaumes d'espagne et de portugal divises selon leurs royaume et provinces. chez l'honoré & chátelain libraires novo itinerario delle poste per tutto il mondo di ottaviano codogno. luogortenente del corriero maggiore di milano 150 años de historia de los ferrocarriles españoles poblaciones españolas de más de 5000 habitantes entre los siglos xvii y xix. boletín de la asociación de demografía histórica infrastructure and regional growth in the european union statistics of spain's colonial trade, 1747-1820: new estimates and comparisons with great britain books ?id=1yrha aaaca aj&dq=camin os&hl=es&sourc e=gbs_ simil arboo ks clio and the economics of qwerty gonzález salmón mg (1829) mapa itinerario de los reinos de espagne et portugal : divisés en ses principales parties ouroyaumes copia del real decreto expedido para hacer caminos rectos, y sólidos en españa, que faciliten el comercio de unas provincias á otras ls?id=7shko 0biwr kc&rdid=book-7shko 0biwr kc&rdot=1 itinéraire descriptif de l'espagne, et tableau élémentaire des différentes branches de l'administration et de l'industrie de ce royaum. h. nicolle, paris de mayerne tt (1605) sommaire description de la france memorial õ abecedario de los mas principales caminos de españa. ordenado por alonso de meneses correo ji (1753) methodo general para todas la vias de españa, por cuyo medio logra el público de la mayor utilidad, consiguiendo por poco porte la conducción de todos los generos, y frutos de unas provincias a otras. domingo fernánde de arrojo network analysis to model and analyse roman transport and mobility. find limit limes comercio colonial y crecimiento económico en la españa del siglo xviiila crisis de un modelo interpretativo mapa itinerario militar de españa why did spanish regions not converge before the civil war? agglomeration economies and (regional) growth revisited a note on two problems in connexion with graphs dirección general de obras públicas (1856) memoria sobre el estado de las obras públicas en españa en 1856 maps in minds: reflections on cognitive mapping. harper and row series in geography comercio y tráfico marítimo en la galicia del antiguo régimen, 1750-1820 adding a new dimension to historical research with gis itinerario español o guía de caminos en donde estan exactamente observadas todas las rutas de postas y caxas de correo estienne c (1552) les voyages de plusieurs endroits de france & encores de la terre saincte, d'espaigne, d'italie & autres pays : les fleuves du royaume de france simulación de una red de transportes. el caso de los ferrocarriles españoles slime mould imitates development of roman roads in the balkans trade integration, market size, and industrialization: evidence from china's national trunk highway system tratado legal, y político de caminos públicos y possadas dividido en dos partes distribution of english textiles in the spanish market at the beginning of the 18th century puertos y comercio de cabotaje en españa problemática del río ebro en su tramo final: informe acerca de los efectos sobre el área jurisdiccional de la comunidad de regantes -sindicato agrícola del ebro raißbüchlin trading population for productivity: theory and evidence urban spatial structure, suburbanization and transportation in barcelona suburbanization and highways in spain when the romans and the bourbons still shape its cities path creation as a process of mindful deviation highway to success: the impact of the golden quadrilateral project for the location and performance of indian manufacturing ferrocarriles y cambio económico en españa (1855-1913): un enfoque de nueva historia económica distant tyranny: markets, power, and backwardness in spain big data' in economic history historical gis as a foundation for the analysis of regional economic growth: theoretical, methodological, and practical issues la reducción de los costes de transporte en españa (1800-1936) the spatial distribution of spanish transport infrastructure between 1860 and 1930 factors influencing the location of new motorways: large scale motorway building in spain market potential and firm-level productivity in spain highways and productivity in manufacturing firms hispaniae et portugalliae regna : felicissimo nuper adventu classico caroli iii. austriaci, hispaniarum and indiarum regis &c. fortunata exhibente io the multi-agent transport simulation matsim espagne divisée en tous ses royaumes et principautés où sont exactement recueillies et obervées [sic] toutes les routes des postes d'espagne, sur les memoires des courriers majors de madrid. jaillot the permanent effects of transportation revolutions in poor countries: evidence from africa history, path dependence and development: evidence from colonial railways, settlers and cities in kenya studying cartographic heritage: analysis and visualization of geometric distortions lópez jimeno c (ed) manual de túneles y obras subterráneas population and accessibility: an analysis of turkish railroad the augsburg travel guide of 1563 and the erlinger road map of 1524 the socio-economic and demographic characteristics of port cities: a typology for comparative analysis? cartografía y comunicaciones en los documentos de la biblioteca nacional. siglos xvi al xix la marina mercante decimonónica madrid: se hallará este con todas las obras del autor y las de su hijo en madrid plazuela del angel n o 19 qto lópéz de vargas y machuca t (1808) carte des royaumes d'espagne et de portugal óu l'on a marqué les routes de poste, et les limites des diverses provinces et gouvernements : pour servir à l'intelligence des opérations militaires. paris: e. collin, graveur editeur nueva guía de caminos para ir a desde madrid a todas las ciudades y villas más principales de españa y portugal, y también para ir de unas ciudades á otras. primera. madrid: gomez fuentenebro y compañçia ed) paisaje, cultura territorial y vivencia de la geografía. libro homenaje al profesor tres arbitristas camineros de mediados del siglo xviii portazgos y tráfico en la españa de finales del antiguo régimen el sistema de comunicaciones en españa el sistema de comunicaciones en españa reformas sin cambio. el mito de los caminos reales de carlos iii the regional dimension in european public policy pensando en el transporte path dependence and regional economic evolution reglamento general expedido por su magestad en 23 de abril de 1720 exposicion de las operaciones geometricas hechas por orden del rey n.s. phelipe v. en todas las audiencias reales situadas entre los limites de francia y de portugal para acertar a formar una mapa exacta y circonstanciada de toda la españa market integration and regional inequality in spain the determinants of industrial location in spain un estudio de nueva geografía económica e historia económica the long-term patterns of regional income inequality in spain an assessment of its territorial coherence 1579) itinerarium orbis christiani. https ://bilds uche.digit ale-samml ungen .de/ index .html?c=viewe r&bandn ummer =bsb00 01647 5&pimag e=3&v=150&nav=&l=en los orígenes de la política ferroviaria en españa (1844-1877) reportorio de todos los caminos de españa (hasta agora nunca visto) juan villuga 1546 1697) il buratino veridico. o vero inztruzzione generale per chi viaggia the 4d future of economic history: digitally-driven data design infrastructure and the shaping of american urban geography qué hacer con españa. del capitalismos castizo a la refundación de un país a new and exact map of spain & portugal : divided into its kingdoms and principalities &c with ye. principal roads and considerable improvements, the whole rectifyd according to ye. newest observations. i has hsr improved territorial cohesion in spain? an accessibility analysis of the first 25 years: 1990-2015 la importancia del mercado español en el comercio exterior portugués revista de historia contemporánea theatre de la guerre en espagne et en portugal path finding by tube morphogenesis in an amoeboid organism the political consequences of spatial policies: how interstate highways facilitated geographic polarization the road to inequality: how the federal highway program polarized america and undermined cities a new map of spain and portugal : exhibiting the chains of mountains with their passes, the principal and cross road, with other details requisite for the intelligence of military operations impacts on the social cohesion of mainland spain's future motorway and high-speed rail networks the historical roots of economic development el monopolio de indias en el siglo xvii y la economía andaluza: la oportunidad que nunca existió. lección inaugural del curso académico improving transportation networks: effects of population structure and decision making policies roads and cities of 18th century france old myths and new realities of transport infrastructure assessment: implications for eu interventions in central europe una serie anual del comercio exterior español (1821-1913) spain's historical national accounts: expenditure and output modelling terrestrial route networks to understand inter-polity interactions (southern etruria, 950-500 bc) cartographier et décrire la péninsule ibérique: l'héritage militaire français deliciae hispaniae et index viatorius, indicans itinera, ab vrbe toleto, ad omnes in hispani civitates et oppida. ursel: cornelij sutorij mapa de españa dividido en prefecturas y divisiones militares complexity and spatial networks: in search of simplicity. advances in spatial science urban growth and the development of transport networks: the case of the dutch railways in the nineteenth century los "apuntamientos" del padre martin sarmiento sobre la construcción de la red radial de caminos reales en españa a terrain ruggedness index that quantifies topographic heterogeneity los transportes y el estancamiento económico en españa (1750-1850) los viajes «a la ligera» : un medio tradicionalmente rápido de transporte, desbancado por el ferrocarril the upswing of regional income inequality in spain (1860-1930) the post for diuers partes of the world to trauaile from one notable citie unto an other, with a description of the antiquitie of diuers famous cities in europe exploring landscapes through modern roads: historic transport corridors in spain sanson n (1691) les monts pyrenees, ou sont remarqués les passages de france en espagne. chez h. laillot, joignant les grands augustins, aux deux globes sanson n (1700) l'espagne divisée en tous ses royaumes et principautés. jaillot, paris sanson n, tavernier m (1641) carte generale d'espagne et de tous les royaumes der älteste reiseatlas der welt the shape of the roman world: modelling imperial connectivity economic networks: the new challenges a correct map of spain and portugal monopolio naturale" di autori postali nella produzione di guide italiane d'europa, fonti storico-postali fra cinque e ottocento. archivio per la storia postale 14-15:19-80 el siglo de hacer caminos-spanish road reforms during the eighteenth century. a survey and assessment an inquiry into the nature and causes of the the wealth of nations. w, straham and t books ?id=ndlsa aaaca aj&print sec=front cover &hl=es&sourc e=gbs_ge_summa ry_r&cad=0#v=onepa ge&q&f=false vie physarale: roman roads with slime mould paris: chez mme. vve. turgis, rue s. jacques n o 16 et à nouvelle carte politique et itinéraire de l'espagne et du portugal : avec la nouvelle division des cartes en 51 provinces los orígenes del capitalismo en españa. banca, industria y ferrocarriles en el siglo xix apuntes para una historia del transporte en españa. los viajes por la posta en el siglo xviii y en los primeros años del siglo xix apuntes para una historia del transporte en españa. los últimos años de los transportes hipomóviles las calzadas romanas y los caminos del siglo xvi historia de los caminos de españa vol. i hasta el siglo xix 1704) regna hispaniarum atque portugalliae novissima et accuratissima tabula regnorum hispaniae et portugalliae. leiden zeer gedenkwaardige en naaukeurige historische reis-beschrijvinge, door vrankrijk nature's order? questioning causality in the modelling of transport networks el reportorio de todos los caminos de españa: hasta agora nunca visto en el ql allará qlquier viaje q quiera andar muy pvechoso pa todos los caminantes visscher n (1704) regnorum castellae veteris, legionis et gallaeciae principatuum q biscaiae et asturiarum accuratissima descriptio visscher n (1705) regnorum castellae novae andalusiae granadae valentiae et murciae accurata tabula highway to hitler economic evolution: an inquiry into the foundations of the new institutional economics en que se proponen varias providencias, dirigidas à promover los intereses de españa, con los medios y fondos de necesarios para su plantificación. segunda. madrid: joachin ibarra the evolving interstate highway system and the changing geography of the united states economic history goes digital: topic modeling the journal of economic history der erste europäische straßenatlas publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations the allocation of effects at the regional level is complicated by the changes in the territorial division that occur during the period. the territorial division of the early eighteenth century was chosen for the calculations. the use of the current territorial division does not seem appropriate as both local identities and internal borders affecting trade do not exactly correspond to this division. the explicit incorporation of toll gates would have been desirable but is not possible because key: cord-342579-kepbz245 authors: galaz, victor; österblom, henrik; bodin, örjan; crona, beatrice title: global networks and global change-induced tipping points date: 2014-05-01 journal: int environ agreem doi: 10.1007/s10784-014-9253-6 sha: doc_id: 342579 cord_uid: kepbz245 the existence of “tipping points” in human–environmental systems at multiple scales—such as abrupt negative changes in coral reef ecosystems, “runaway” climate change, and interacting nonlinear “planetary boundaries”—is often viewed as a substantial challenge for governance due to their inherent uncertainty, potential for rapid and large system change, and possible cascading effects on human well-being. despite an increased scholarly and policy interest in the dynamics of these perceived “tipping points,” institutional and governance scholars have yet to make progress on how to analyze in which ways state and non-state actors attempt to anticipate, respond, and prevent the transgression of “tipping points” at large scales. in this article, we use three cases of global network responses to what we denote as global change-induced “tipping points”—ocean acidification, fisheries collapse, and infectious disease outbreaks. based on the commonalities in several research streams, we develop four working propositions: information processing and early warning, multilevel and multinetwork responses, diversity in response capacity, and the balance between efficiency and legitimacy. we conclude by proposing a simple framework for the analysis of the interplay between perceived global change-induced “tipping points,” global networks, and international institutions. global environmental change can unfold rapidly and sometimes irreversibly if anthropogenic pressures exceed critical thresholds. such nonlinear change dynamics have been referred to as ''tipping points''. 1 evidence indicates that many global environmental challenges such as coral reef degradation, ocean acidification, the productivity of agroecosystems, and critical earth system functions such as global climate regulation display nonlinear properties that could imply rapid and practically irreversible shifts in bio-geophysical and social-ecological systems critical for human well-being (steffen et al. 2011; rockström et al. 2009; lenton et al. 2008) . the potential existence of ''tipping points'' represents a substantial multilevel governance challenge for several reasons. first, it is difficult to a priori predict how much disturbances and change a system can absorb before reaching such a perceived ''tipping points'' (scheffer et al. 2009; scheffer and carpenter 2003) , a fact that seriously hampers, and even undermines preventive decision making (galaz et al. 2010a ; barrett and dannenberg 2012) . it has also been argued that even with certainty about the location of a catastrophic ''tipping point,'' present generations still have an incentive to avoid costly preventive measures, and instead are likely to pass on the costs to future generations (gardiner 2009 , see also brook et al. 2013; schlesinger 2009 ). second, the transgression of ''tipping points''-such as those proposed for coral reef ecosystems due to ocean acidification-can have social-ecological effects that occur over large geographical scales, creating difficult ''institutional mismatches'' as policy makers respond too late or at the wrong organizational level (walker et al. 2009 ). third, institutional fragmentation has been argued to seriously limit the ability of actors to effectively address perceived ''tipping point'' characteristics due to inherent system uncertainties, information integration difficulties, and poor incentives for collective action across different sectors and segments of society (biermann 2012; galaz et al. 2012) . research initiatives such as the earth system governance project have made important analytical advances the last few years ). this progress is particularly clear in research areas such as institutional fragmentation, segmentation, and interactions; the changing influence of non-state actors in international environmental governance; novel institutional mechanisms such as norm-setting and implementation; and changing power dynamics in complex actor settings (young 2008; biermann and pattberg 2012; oberthür and stokke 2011) . the mechanisms which allow institutions and state and nonstate actors to adapt to changing circumstances (by some denoted adaptiveness) are also gaining increased interest by scholars (biermann 2007:333; young 2010) . another stream of literature elaborates a suite of multilevel mechanisms that seem to be able to ''match'' institutions with the dynamic behavior of social-ecological systems (dietz et al. 2003; folke et al. 2005; cash et al. 2006; galaz et al. 2008; pahl-wostl 2009) . despite an increased interest, however, few empirical studies exist that explicitly explores the capacity of international actors, institutions, and global networks to deal with perceived ''tipping point'' dynamics in human-environmental systems. as an example, recent syntheses of critical global environmental governance challenges in the anthropocene identify important ''building blocks'' for institutional reform, yet do not elaborate governance mechanisms critical for responding to nonlinear environmental change at global scales kanie et al. 2012) . this is troublesome considering that the human enterprise now affects systems with proposed nonlinear properties at the global scale (steffen et al. 2011; rockström et al. 2009 ) and that the features of global environmental governance-i.e., institutions and patterns of collaboration-required to address ''tipping point'' changes are likely to be very different from those needed to harness incremental (linear) environmental stresses (folke et al. 2005; duit and galaz 2008) . while some of the challenges explored here have parallels to attempts to understand the features of international responses to international social nonlinear ''surprise'' phenomena such as financial shocks and other global risks (e.g., claessens et al. 2010; oecd 2011) , we are particularly interested in the ''tipping point'' dynamics created by coupled human-environmental change. there is an increasing need to empirically explore and theorize the way state and non-state actors perceive, respond to and try to prevent large-scale abrupt environmental changes. this is a challenging empirical task for several reasons. first, because the definitions of ''thresholds,'' their reversibility, and their scale remain contested issues. there is currently an intense scientific debate on the most appropriate way to define and delineate the correct spatial (local-regional-global) and temporal (slow-fast) progression of thresholds (for example, see brook et al. 2013 , as well as debates about the ''2-degree'' climate target, hulme 2012 in knopf et al. 2012) . second, because the number and types of potential ''tipping points'' relevant for the study of biophysical systems at global scales are too large to be quantifiable, this makes it difficult to draw general conclusions from a limited set of cases studies. third, while the governance challenges associated with ''tipping points'' have been studied extensively for local-and regional-scale human-environmental systems such as forest ecosystems, freshwater lakes, wetlands, and marine systems (see plummer et al. 2012 for a synthesis), governance challenges associated with global scale or global change-induced tipping point dynamics are seldom explored despite their identified urgency (young 2011: 6) . in this study, we analyze current attempts by three global networks to address perceived ''tipping points'' induced by global change, here exemplified by the combined impacts of loss of marine biodiversity and ocean acidification, pending fisheries collapse, and infectious disease outbreaks, respectively. while these ''tipping points'' are in many ways different, they have one important thing in common: they are all globally occurring phenomena where the interplay between technological change, increased human and infrastructural interconnectedness, and continuous biophysical resource overexploitation creates possible ''tipping point '' dynamics (see definition below) . this is what we here denote as ''global change-induced tipping points.'' it should be noted that the precise dynamics of the ''tipping points'' in each of the three cases differ and unfold at multiple scales ranging from local to regional and global (see table 1 for details). despite this diversity, the phenomena studied here all pose similar detection, prevention, and coordination challenges for social actors at multiple scales of social organization (as elaborated by, e.g., adger et al. 2009; pahl-wostl 2009; galaz et al. 2010a; young 2011) . hence, they provide interesting and preliminary insights into the sort of multilevel governance challenges associated with nonlinear global change-induced change. our ambition is twofold: firstly, we investigate how these global networks attempt to identify, respond to, and build capacity to address global change-induced ''tipping points.'' we do this by identifying and empirically illustrating four working propositions that theoretically seem to be essential for understanding adaptiveness in global networks in the face of ''tipping point'' changes. secondly, we explore the interplay between perceptions of human-environmental ''tipping points,'' international institutions, and collaboration patterns here operationalized as global networks. this perspective allows for a more cohesive view that combines both exogenous (i.e., perceived ''tipping point'' dynamics) and endogenous (i.e., information sharing mechanisms) factors (sensu young 2010). we will return to this point in the end of the article. note that, our ambition is not to ''test'' the proposed features in a statistical sense, but rather to combine in-depth analysis of the case studies, with tentative suggestions that bring to light a number of intriguing and poorly explored issues. while the selection of cases does not provide the generalizability of larger-n studies, it does allow us to examine how the theoretically derived propositions play out in the three different cases, and also provides an opportunity to explore around the potential interplay between variables and causal mechanisms. our ambition in the longer term is that the analysis presented here can underpin more systematic cross-case comparisons (cf. gerring 2004) . in some cases, this comparison could be done with ''tipping points'' which play out only in the social domain such as financial crises or responses to transnational security threats (e.g., world economic forum 2013). we define ''global networks'' as globally spanning information sharing and collaboration patterns between organizations, including governmental and/or non-governmental actors. each individual participating organization is not necessarily global, but the network as a whole is essentially international and aims to affect what is perceived as global-scale problems (c.f. monge and contractor 2003) . our analytical approach is relational as it focuses on broad patterns of collaborations among actors in global networks, and on how patterns of collaboration and modes of operation relate to these networks' abilities to address ''tipping points.'' it is particularly concerned with multiactor agency to understand changes in collaboration over time (emirbayer and goodwin 1994) . hence, the approach here differs and is complementary to other approaches such as ''epistemic communities'' and ''regime complexes.'' in the first case, the functions and membership of global networks are more diverse than those for epistemic communities as the networks of interest in this paper span beyond knowledgebased collaborations (haas 1992 , see however cross 2012 for a wider definition). the analysis here also has some similarities to the studies of ''regime complexes'' (orsini et al. 2013) ; however, our emphasis is not on the interplay between regimes or institutions (principles, norms, rules, decision-making procedures), but rather on the constellation, interplay, and functions that emerge between actors from a network perspective. while we do not quantitatively measure collaboration patterns, nor quantitatively assess governance outcomes or relationships between such outcomes and various characteristics of the studied global networks, our network perspective complements previous analyses of international environmental regimes (e.g., young 2011) as it examines the evolution and function of globally spanning network collaboration patterns and their embeddedness within more formal rules (cf. ansell 2006). various definitions for ''tipping points'' have been identified in the literature. its theoretical origins can be traced to dynamical systems theory in the 1960s and 1970s, and has influenced a wide set of disciplines the last decades. earth system scientists, for example, have explored nonlinear phenomena at different scales, with different degrees of reversibility and alternately defined these as ''tipping elements,'' ''switch and choke points,'' or ''planetary boundaries'' at a global scale (lenton et al. 2008; rockström et al. 2009b; steffen et al. 2011) . ecologists on the other hand find increasing evidence of ecosystem changes that are not smooth and gradual, but abrupt, exhibiting thresholds with different degrees of reversibility once crossed (scheffer et al. 2009; scheffer and carpenter 2003) . while ''tipping points'' thus refer to a variety of complex nonlinear phenomena (including the existence of positive feedbacks, bifurcations, and phase transitions with or without hysteresis effect), they also play out at different scales (from local to global) in very complex, poorly understood , and contested ways (brook et al. 2013) . lastly, we believe there is an irreducible social component in identifying, elaborating, and organizing around the existence of ''tipping points.'' the role of mental models, cognitive maps, belief systems, and collective meaning making in decision making has a long history in the study of agency in politics (benford and snow 2000; campbell 2002) and natural resource management (lynam and brown 2011) . these aspects related to perceptions clearly also play a role as scientists, governments, and other actors discuss and sometime disagree on their possible existence and appropriate responses (galaz 2014:16ff) . the important connection between mental models and goal-oriented action are causal beliefs-perceptions of the causes of change-and about the actions that can lead to a desired outcome (milkoreit 2013:34f) . it should be noted, however, that this study explores the processes of collaboration that occur after social actors implicitly have agreed upon causal beliefs associated with perceived ''tipping points.' ' we thus recognize the multifaceted and complex nature of the term. however, as our goal is to understand how global networks address a diversity of ''tipping points'' induced by global change, we opt for a definition that relates to phenomena exhibiting nonlinear and potentially irreversible change processes in human-environmental systems, which require global responses (see table 1 ). this definition is intended to capture global changeinduced phenomena, which due to nonlinear properties such as synergistic feedbacks (c.f. brook et al. 2013:2) have the potential to affect large parts of the world population (c.f. lenton et al. 2008 ). our definition is akin to the ''tipping elements'' identified by lenton et al. (2008) , but we chose to make a distinction for two reasons: (1) because we explicitly acknowledge that the dynamics that create the tipping points in focus comprise both natural and social processes; (2) the tipping points addressed here span beyond those of relevance for the climatic system in focus in lenton et al. (ibid) . the empirical analysis includes case studies of three global networks ( fig. 1 ) that with varying degrees of outputs and outcomes (sensu young 2011) explicitly attempt to respond to global change-induced ''tipping points'' (table 1) . by ''explicitly,'' we mean that they all acknowledge and mobilize their actions around the notion of potentially harmful global change-induced ''tipping points.'' while this selection does not capture cases where ''tipping points'' exist, but global networks fail to materialize (c.f. dimitrov et al. 2007) , the ambition has been to include cases that reflect a diversity of global change-induced ''tipping point'' dynamics, with global network responses as common features. as noted by mitchell (2002) , the complexity of human-environmental systems makes outcomes of international environmental collaboration hard to measure directly, since these often have indirect and non-immediate impacts (mitchell 2002:445) . to operationalize the analysis, we thus focus on the outputs and outcomes produced by the actors as they attempt to (a) anticipate, (b) prevent, and (c) respond to perceived ''tipping points.'' again, our ambition is not to ''test'' the propositions, but rather to use the case studies to empirically explore how global networks attempt to respond to ''tipping point'' challenges, pose novel important questions, and explore how to potentially pursue these questions systematically. ocean acidification is likely to exhibit critical tipping points associated with rapid loss of coral reefs, as well as complex ocean-climate interactions affecting the oceans' capacities to capture carbon dioxide. the global arena for addressing these interrelated aspects of marine governance is characterized by a lack of effective coordination among the policy areas of marine biodiversity, fisheries, climate change, and ocean acidification. this has triggered collaboration between a number of international organizations within the global partnership for climate, fisheries and aquaculture (henceforth pacfa). currently, this initiative includes representatives from the food and agriculture organization (fao), the united nations environment programme (unep), worldfish, the world bank, and 13 additional international organizations , see also fig. 1 ). the need for early and reliable warning of pending epidemic outbreaks has been a major concern for the international community since the mid-nineteenth century. early warning and coordinated responses are critical in order to avoid transgressing critical epidemic thresholds at multiple scales (heymann 2006) . a multitude of networks with different focus and functions (such as surveillance, laboratory analysis, and pure information sharing) have emerged the last two decades as a means to secure early warning and response capacities across national borders. these networks (henceforth global epidemic networks) also span across organizational levels and include among other the world health organization (who), the world organization for animal health (oie), the food and agricultural organization (fao), the red cross, doctors without borders. these networks facilitate responses to epidemic emergencies of international concern by providing early warning signals, rapid laboratory analysis, information dissemination, and coordination of epidemic emergency response activities on the ground (galaz 2011 the data used in this article are drawn from previously published studies (see ''appendix 2''), but complemented with additional documents, and restructured with a focus on the four working propositions presented below. all cases used the same methodology, combining document studies, simple network analysis, and semi-structured interviews with key international actors (a total of about 65 interviews for all cases). interviewees have been selected strategically to reflect an expected diversity of interests, resources, and role in the network collaboration of interest. the empirical material was restructured and complemented to capture (1) the perceived risk of ''tipping points'' with cross-boundary effects, including major uncertainties, threshold mechanisms, and time frames of relevance; (2) the global nature of the problem, including the aspects of institutional fragmentation; (3) emergence of the network and evolution over time; (4) current monitoring and coordination capacity of the network; (5) the most important outputs and outcomes of the network. a detailed, fully referenced and structured compilation of the cases is for space reasons, available as ''appendix 2.'' several attempts have been made to identify how governments, organizations, and international actors attempt to not only maintain institutional stability, but also flexibility in the face of unexpected change (duit and galaz 2008) . studies of crisis management (boin et al. 2005) , network management and ''connective capacity'' in governance (edelenbos et al. 2013) , the robustness of social and organizational networks (dodds et al. 2003; bodin and prell 2011) , complexity leadership (balkundi and kilduff 2006) , and the governance of complex social-ecological systems (ostrom 2005; folke et al. 2005 ; pahl-wostl 2009) all provide important insights into how social actors try to respond to unexpected changes. based on these streams of literature, we extract four working propositions to guide the analysis of the empirical material. these propositions should not be viewed as allembracing principles for ''success,'' but rather as suggested and empirically assessable governance functions at play as international actors aim to detect, respond, and prevent the implications of ''tipping point'' environmental change. the propositions also allow a concluding reflection of the interplay between international institutions, global changeinduced ''tipping points,'' and global networks. as a number of research fields-ranging from studies of social-ecological systems (holling 1978; folke et al. 2005; galaz et al. 2010b) , crisis management (boin et al. 2005) , global disaster risk reduction (van baalen and van fenema 2009), and high-reliability management of complex technical systems (pearson and clair 1998)-have explored, the capacity to continuously monitor, analyze, and interpret information about changing circumstances seems to be a prerequisite for adaptive responses. these information processing capacities are for example identified as critical by, e.g., dietz et al. (2003 dietz et al. ( :1908 in their elaboration of the features of governance, which support adaptiveness to environmental change; van baalen's and van fenema's (2009) analysis of global network responses to epidemic surprise; and by boin et al. (2005:140ff) and their synthesis of information processing capacities, which allow state and non-state actors to respond to crises (see also comfort 1988; dodds et al. 2003) . we therefore propose that the ability of global networks to anticipate and reorganize in the face of new information and changing circumstances will depend on access to information about system dynamics and the ability to interpret these to facilitate timely coordinated response. global networks and global change-induced tipping points 197 as a number of studies in a diverse set of research fields suggest, responding to changing circumstances often requires drawing on the competences and resources of actors at multiple levels of social organization, often embedded in different organizational networks. for example, folke et al. (2005) explore the need to build linkages between a diversity of actors at multiple levels to be able to successfully deal with nonlinear social-ecological change (see also pahl-wostl 2009); pearson and clair's (1998:13) synthesis of organizational crises identify resource availability through external stakeholders as key for successful responses (see also boin et al. 2005 ; van baalen and van fenema 2009); galaz et al. (2010a:12) synthesis also indicate that cross-level and multiactor responses are key for overcoming institutional fragmentation in responses to abrupt human-environmental change; and edelenbos et al. (2013) explore the features of ''connective capacities,'' which allow coordinating actors to constructively connect to actors from different layers, domains, and sectors. therefore, we propose that global networks need to build a capacity to coordinate actors at multiple levels and from different networks as they attempt to respond to potential ''tipping points'' of concern. it also seems reasonable that responses could focus on producing either a set of outcomes (regulation, polices, or supporting infrastructure) or outputs (behavioral changes with direct impacts on the system of interest, including coordinated action) (young 2011). the predictability of the timing and location of rapid unexpected changes is often limited. hence, it is difficult to know beforehand exactly where and what kind of resources may be needed. as several scholars across disciplines have noted, developing and maintaining a diversity of resources is one way to help prepare for the unexpected. for example, moynihan's (2008:361) analysis of learning in organizational networks indicates that the maintenance of actor and resource diversity in networks is fundamental for reducing strategic and institutional uncertainty (see also koppenjan and klijn 2004) . bodin and crona (2009) explore the role of social networks in natural resource management and note that actor diversity is key for not only effective learning, but also provides a fertile ground for innovation. social-ecological scholars such as folke et al. (2005:449) , low et al. (2003) , and dietz et al. (2003) also make a strong case for institutional and actor diversity as a prerequisite for coping with environmental change and as a factor that allows actors to quickly ''bounce back'' after shocks. these proposals bear resemblances with ideas inspired by cybernetic principles about the need to understand the impacts of ''requisite variety'' in governance, that is, the informational, structural, and functional redundancies that emerge as the result of inter-organizational collaboration in governance networks (jessop 1998) . it also has parallels existing studies about complexity leadership in network settings (balkundi and kilduff 2006; hoppe and reinelt 2010) . to simplify the analysis, however, we focus less on the role of individual leaders, their cognition, and strategies, but rather on common structural properties and aggregated behavior across the cases. hence, we propose that global networks need to secure long-term access to a diversity of resources (human and economical), organizational forms (e.g., non-governmental to international organizations), and types of knowledge (e.g., scientific and context dependent) to be able to secure a capacity to monitor and respond in the longer term. these resources could comprise both physical and infrastructural investments (e.g., joint publications and monitoring systems), and immaterial resources, like information databases and access to expertise through relations to a diversity of actors. this third proposition does not suggests that the network necessarily maintains the resources itself (which may be difficult given its inherently distributed nature), but rather that it maintains the capacity to access a portfolio of resources of different kinds, and the collective ability to adaptively use these. a last feature that has been identified across disciplines as critical for multiactor responses seems to be legitimacy. ''legitimacy'' is a multifaceted concept with multiple proposed sources (downs 2000) . here, we refer to legitimacy as the ''generalized perception or assumption that the actions of an entity are desirable, proper, or appropriate'' (suchman 1995, 574) . as young (2011) summarizes it, the ''[m]aintenance of feelings of fairness and legitimacy is important to effectiveness, especially in cases where success requires active participation on the part of the members of the group over time.'' legitimacy relates not only to the ability of actors to follow appropriate rules or procedures (input legitimacy) but also to deliver expected results in the face of perceived urgent issues of common interest (output legitimacy) (van kersbergen and van waarden 2004) . moynihan (2008) also notes that trust based on perceived legitimacy plays a key role in the coordination of multiple networks by reducing strategic uncertainty (pp. 357). we therefore propose that the perceived legitimacy of the main coordinating actor(s) will be critical for the operation of global networks in this context. with these propositions to guide our analysis, we now turn to examine how the three cases of global networks attempt to anticipate, prevent, and respond to global changeinduced ''tipping points.'' 6 summary of results 2 6.1 information processing and early warnings interestingly, all cases feature the role of a few centrally placed actors responsible for continuous data gathering and exchange of information. this is both related to monitoring of specific aspects of a system (e.g., number and location of epidemic outbreaks, or reports of iuu fishing), as well as to the compilation and analysis of other types of knowledge exchange (e.g., policy documents, technical guidelines, and scientific information). the mobility of iuu fishing vessels and associated products requires network members to be able to instantly coordinate action in order to stop illegal activities. the ccamlr-iuu network has developed mechanisms for obtaining, processing, and sharing of information related to iuu fishing operations and trade flows, where the ccamlr secretariat serves as an important network hub (ö sterblom and bodin 2012). this network benefits from several well-established compliance mechanisms developed over time, including an electronic catch documentation scheme and information collected from satellite monitoring of vessel activities . non-state actors also contribute to monitoring by reporting suspected vessel sightings or trade flows (ö sterblom and bodin 2012). all information collected is reviewed annually by ccamlr, where consensus decisions are taken to blacklist suspected vessels or impose other sanctions. similar features can be identified for information processing in global epidemic networks. severe problems with national reporting of disease outbreaks spurred a new generation of internet-based monitoring systems in the mid-1990s, including the global public health intelligence network (gphin) hosted by health canada, combining internet datamining technologies and expert analysis, as well as global and moderated epidemic alert e-mail lists and platforms such as promed and healthmap.org (galaz 2011) . these systems have vastly increased the amount of epidemic early warnings processed by key international organizations such as the who and the food and agricultural organization (fao). these organizations are key actors through their capacities to continuously assess, verify, and disseminate incoming epidemic alerts. the rich flows of information and elaborate verification mechanisms in these two networks are very different from those identified for pacfa. information flows in this network are considerably less formalized and instead center on establishing dialogue between centrally placed individuals who act as points of contact in different international and regional organizations. the importance of trust-based communication between coordinating actors is important for pacfa as alliances are forged to link international negotiations, scientific knowledge and local knowledge, and field projects . here, therefore, information processing is not about monitoring a particular system variable (i.e., number of infected cases or reports of illegal fishing vessels), but rather on achieving coordination benefits for its members. hence, while information processing is important in all cases, it differs in content and function between the networks. the fact that the first two networks respond to well-defined problems (inception of illegal vessels in one region and the isolation of disease outbreak) and the last to more complex global challenges (i.e., global interlinked bio-geophysical dynamics) seems to make an important difference. while multilevel governance is a common feature of environmental polity in general (winter 2006) , governing ''tipping points'' requires a capacity among centrally placed a examples of detection, coordination, and apprehension of vessels or corporations suspected of iuu fishing. the vessel ''viarsa'' was detected in the southern ocean by the australian coast guard and suspected of illegally fishing of patagonian toothfish. after the longest maritime hot pursuit in maritime history (7,200 km) and substantial diplomatic coordination, the vessel was seized after the combined effort from australian, south african, and uk assets. the charges were eventually dropped. b examples of detection and response to emerging infectious diseases of international concern. the first signs of a new flu a/h1n1 (''swine flu'') could be found early in local news reports posted on healthmap. it was one report by the early warning system gphin that alerted the who of an outbreak of acute respiratory illness in the mexican state of veracruz. this induced several iterations of communication between mexico and the who (2), as well as sharing of virus samples to us and canadian medical laboratories (3). at this stage, the who also issued several recommendations to its member states (4). this initiated a chain of national responses (some clearly beyond who recommendations), including thermal screening at airports, travel restrictions, and trade embargoes (5). the suspected causative agent of a/h1n1 induced stronger collaboration between who, fao, and oie, and the expansion of associated expert groups to include swine flu expertise resulting in continuous recommendations to member states (6). v. galaz et al. actors, to rapidly pool resources from participating network members at multiple levels. figure 2 is based on two case analyses (see ''appendix 2'' for details) and illustrates these multilevel collaboration and information sharing processes. national delegations to ccamlr integrate several types of actors, including ngos and fishing industry representatives, who in turn are also members in asoc, the antarctic southern ocean coalition (a global ngo network) or colto-coalition of legal toothfish operators (an industry network). this network can coordinate actors within matters of days and involve rapidly mobilizing capacity to apprehend illegal vessels (fig. 2a) . global epidemic networks are also nested in a larger global network landscape, with a similar ability to rapidly react to epidemic early warnings, including the identification of the sars coronavirus, analysis and response to an unknown form of influenza in madagascar, and prevention of epidemics of yellow fever in côte d'ivoire and senegal. which national, regional, or international organization that becomes the central coordinator depends on the disease agent and location of interest. however, the who global outbreak and response network (goarn) and the fao emergency centre that facilitates coordination for transboundary animal diseases (ectad) (fig. 2b ) are well-known key coordinating players. while the pacfa network does not respond to rapidly unfolding system dynamics per se, it tries to identify political opportunities in international policy arenas as a way to prevent ''tipping points'' related to marine systems and biodiversity. hence, the coordination challenges across multiple levels are similar as those identified for the other two networks, but with a different focus. for example, while a overall network aim was initially to influence international climate negotiations in copenhagen 2009 to integrate marine issues, the path toward this goal consisted of multiple coordinated actions at multiple levels (from assessing potential adaptation needs locally, communicating these, and influencing national delegations at side events to cop15). this required a capacity among a core group of actors (fao) to tap into resources and knowledge across levels and networks, including international scientific institutions (such as ices), international organizations (world bank), and place-based research ngos (worldfish) ). maintaining response capacity over time requires maintaining access to diverse resources and competences. all three cases examined here have evolved through time by strategically expanding the membership of the network to increase their ''portfolio diversity.'' ccamlr hosts and funds strategic training workshops in regions where effective response capacities have been viewed as lacking (e.g., in southern africa and malaysia). several member countries cooperate extensively around offshore monitoring and training, in order to improve the joint enforcement capacity of the network (ö sterblom and bodin 2012). actors within the network-coordinated nationally, between organizations or between groups of countries-also continuously develop suggestions for new and revised policy measures to address iuu fishing. global epidemic networks have stretched over time as the result of an explicit strategy to expand international surveillance and response networks, particularly in epidemic hotspot regions where surveillance is weak (e.g., asia and africa) and for diseases perceived as critical for the international community (such as avian influenza). the expansion is both strategic and crisis-driven, and involves continuous capacity building through workshops, conferences, and guidelines (e.g., heymann 2006) . similarly, pacfa has expanded membership in parts of the world where representation is missing and where tangible field presence could prove useful for attempts to link local governance to global institutional processes. it has also included member organizations of various types-from ngos to scientific organizations-as an explicit strategy to increase skills and resource portfolios. this has also proven as a fruitful way to create a platform for exchange of knowledge, ideas, and information among members. this seems to improve the network's capacity to coordinate local responses (such as improved local marine governance in the face of ocean acidification) and integrate scientific advice into negotiation texts for international policy improvement ). the synthesis indicates that issues of legitimacy are constantly being debated in all networks, but that their responses differ. in general, however, centrally placed actors seem to build legitimacy by strategically enhancing the diversity and number of members, increasing the degree of formalization in what originally were informal collaboration mechanisms, and by encouraging the entrenchment of the networks in various un organizations. cooperation within ccamlr to address iuu fishing emerged in a context where there was a significant risk of fish stock collapse, but where governments were unable to effectively act due to political sensitivities and constraints posed by consensus mechanisms in the network (ö sterblom and bodin 2012). controversial and unorthodox methods for conveying the importance of addressing iuu fishing were instead developed by a small ngo-fishing industry coalition in the mid-1990s (ö sterblom and sumaila 2011). a few governments in parallel also began exerting diplomatic pressure on member states of the commission that were associated with illegal fishing, for example as flag, or port states or with their nationals working on board iuu vessels or as owners of associated companies. this has at times resulted in substantial controversy and heated debate between member states about responsibilities and the role of ngos in ccamlr. joint enforcement operations in the southern ocean have been described as pushing the edge of international law (gullett and schofield 2007) , and suspected offenders have stated that they do not recognize existing territorial claims in the ccamlr area as legitimate (baird 2004) . continuous improvements of conservation measures and decision-making processes in ccamlr have proven important for securing legitimacy of procedures. legitimacy issues are also addressed by making commission reports available online (except for background reports and reports containing diplomatically sensitive material). the pacfa has also struggled to balance legitimacy and efficiency. as the goal to influence the unfccc process emerged, the ambitions of pacfa also become more explicitly political. this created tensions in the network between actors wanting to achieve tangible outcomes (and thus output legitimacy), and those concerned with overstepping their respective organizations' mandate (thus maintaining input legitimacy). a clear fault line with respect to this was observed between central international organizations and those representing sciencebased organizations in the buildup to the climate negotiations in copenhagen 2009. while most of the activities initially evolved through the work of a few centrally placed actors with modest formal support from the fao and its member states , the network has become increasingly formalized and recently became a un oceans-taskforce, which is likely to increase the network's input legitimacy in the un system. addressing the risks of novel infectious disease outbreaks has been very high on the political agenda for the last decade, especially in the face of recurrent outbreaks of novel animal influenzas with the capacity to infect humans. information processing and global networks and global change-induced tipping points 203 coordination work is currently supported through the revised international health regulation (ihr). however, this cooperation model orchestrated by the who has also raised severe issues of both input and output legitimacy, especially issues such as the role of scientific advice and the influence of pharmaceutical companies; vaccination recommendations associated with the last pandemic outbreak (''swineflu'' a/h1n1); and unequal global access to treatments. this has led to repeated calls for governance reforms aiming to increase transparency, effectiveness, and benefit sharing. it should be noted that border protection issues, primarily associated with the fear of new terror attacks after september 11 2001 in the form of intentional releases of novel diseases, have triggered substantial investments in global epidemic monitoring and response networks for disease outbreaks. analogous national security concerns in australia after the national elections in the year of 2001, and the bali bombings in 2002, likely contributed to an increased political opportunity to invest substantially in monitoring technologies for the southern ocean, as border protection became ''securitized'' (ö sterblom et al. 2011 ). in addition, international cooperation and exchange of information between compliance officers has increased substantially after the terrorist attacks in new york city in 2001. the expansion of the studied two networks described here thus appears to be linked to the securitization of the issue areas, which may have important implications for their perceived legitimacy in the future (c.f. curley and herington 2011). here, we examined three globally spanning networks that all attempt to respond to a diverse set of perceived global change-induced ''tipping points.'' as the analysis shows, the working propositions have highlighted several interesting functions worth further critical elaboration, associated with the attempts to govern complex, contested, and ''tipping point'' dynamics of global concern (summarized in table 2 ). in short, we have illustrated how state and non-state actors (here operationalized as global networks) attempt to build early warning capacities and improve their information processing capabilities; how they strategically expand the networks, as well as diversify their membership; how they reconfigure in ways that secures a prompt response in the face of abrupt change (e.g., novel rapidly diffusing disease, illegal fishery) or opportunities (e.g., climate negotiations); and how they mobilize economical and intellectual resources fundamentally supported by advances in information and communication technologies (e.g., through satellite monitoring and internet data mining). but crises responses are only one aspect of these networks. between times of abrupt change, centrally placed actors in the networks examined are involved in strategic planning aiming to bridge perceived monitoring or response gaps, capacity building needs, and secure longer-term investments. maintaining legitimacy seems to be critical also empirically for the ability of global networks to operate over time. preventing the transgression of perceived critical ''tipping points,'' however, requires not only early warning and response capacities, but also an ability to address complex and underlying human-environmental drivers that contribute to the problem at hand (walker et al. 2009 ). it is important to note that none of the networks studied here have neither the ability nor the mandate to directly address key underlying drivers such as climate change (e.g., for ocean acidification), land-use changes (e.g., associated with changed zoonotic risks), or technological change (e.g., contributing to increased interconnectedness and the loss of marine biodiversity). hence, it remains an open question whether global networks as the ones studied here will ever be able to collaborate if stipulated goals become more conflictive and complex, due to interactions between global drivers such as technological, demographical, and environmental (galaz 2014). yet, it would be a mistake to discard global networks as mere ''symptom treatment.'' the issues elaborated here not only exemplify how interacting institutions affect humanenvironmental systems at global scales (gehring and oberthür 2009 ). the cases also display how global networks attempt to complement functional gaps in the complex institutional and actor settings in which they are embedded. the perceived ''sense of urgency'' (i.e., avoiding the next pandemic, coping with potentially rapid ecological shifts in marine systems, or avoiding large-scale fish stock collapses) seemingly triggers the developed mechanism for obtaining, verifying, and sharing of information about, e.g., illegal vessels through its secretariat. ngo's and licensed fishing industry play key roles in complementing information action involves using political opportunities to bring marine issues on top of international policy agendas, and secure support and funding action involves rapidly coordinating crossnational and multiexpert teams to analyze and respond to epidemic emergencies of international concern action involves rapidly mobilizing national and international agents and assets to apprehend illegal vessels. emergence of global networks created by concerned state and non-state actors. figure 3 illustrates this proposed interplay between international institutions, perceptions of ''tipping points,'' and global networks. by shaping state and non-state action, international institutions play a critical role in affecting the creation of potential global change ''tipping points'' (a in fig. 3 ) (young 2008 (young , 2011 . these perceived ''tipping points'' also create mixed incentives (b) for collective action. while coordination failure is likely due to actor, institutional, and biophysicial complexity (young 2008) , the perceived urgency of the issue can also create incentives for action among international state and non-state actors, and spur the emergence of global networks based on common causal beliefs (b). these networks can support the enforcement (c) of existing international institutions through their ability to process information and coordinate multinetwork collaboration, as well as create the endogenous and exogenous pressure needed to induce changes in international institutions. as young notes, these sort of self-generating mechanisms can help build adaptability (d) and combat ''institutional arthritis'' (young 2010, 382) . for example, the emergence of novel zoonotic diseases (such as avian influenza) is intrinsically linked to the effectiveness of a suite of institutional rules at multiple levels, e.g., though urbanization, land-use change, and technological development (interplay). the potential of these diseases to rapidly transgress dangerous epidemic thresholds creates incentives for joint action, in this case through the emergence of global early warning and response networks, despite malfunctioning formal institutions (e.g., ihrs before the year 2005). as nation states agreed to reform the ihrs in 2005, the revisions built on technical standards, organizational operation procedures, and norms developed by who-coordinated networks years in advance (adaptability) (heymann 2006) . a similar mechanism seems to be at play for illegal, iuu fisheries. as the regional mandate of the existing international governance institution for iuu fishing proved insufficient (interplay), and the perceived threat of potential detrimental ''tipping points'' was perceived as valid (incentives), state and non-state actors increasingly developed their networks to operate at the global level, thereby drastically improving the enforcement capacities of existing international rules (ö sterblom and sumaila 2011). international actors trying to prepare for the possibly harmful human well-being implications of ocean acidification and rapid loss of marine biodiversity, also illustrate this triad. as these actors perceive the possible transgression of human-environmental ''tipping points'' (incentives), they coordinate their actions in global networks to increase their opportunities to bring additional issues to existing policy arenas created by international institutions (adaptability). at the same time, these institutions fundamentally affect the biophysical, technological, and social drivers that affect the ''tipping points'' at hand (interplay, e.g., the convention on biological diversity, climate change agreements under the unfccc, and the united nations convention on the law of the sea). the analysis is only tentative of course, especially considering the small number of cases, the contested nature of ''tipping points,'' and the need to explore additional working propositions. for example, we have not elaborated the cognitive and leadership processes leading up to a joint problem definition among the collaborating actors, nor a number of associated issues such as transparency and accountability in complex actor settings. however, the analysis brings together a number of theoretically and empirically founded propositions worth further attention. more precisely, how state and non-state actors perceive and frame global change-induced ''tipping points,'' the unfolding global network dynamics, and how these are shaped by international institutions (fig. 3) remain an interesting issue to explore further by scholars interested in the governance of a complex earth system. acknowledgments this research was supported by mistra through a core grant to the stockholm resilience centre, a cross-faculty research centre at stockholm university, and through grants from the futura foundation. h. ö . was supported by baltic ecosystem adaptive management (beam) and the nippon foundation. ö . b. was supported by the strategic research program ekoklim at stockholm university. b. c. was supported by the erling-persson family foundation. we are grateful to colleagues at the stockholm resilience centre, and to oran young, ruben zondervan and sarah cornell for detailed comments on early drafts of the article. pacfa is based on official web page (http://www.climatefish.org/index_en.htm) and galaz et al. 2011 . the network around iuu fishing in the ccamlr area is based on ö sterblom and sumaila (2011). note that not all members of ccamlr http://ccamlr.org/pu/e/ms/ contacts.htm are actively engaged in reducing iuu fishing as this is an organization tasked with multiple issues related to natural resources in the southern ocean. appendix 2: summary of case studies and template for data collection here, we briefly summarize the case studies and the protocol for data collection. data have been collected through semi-structured interviews, literature reviews, and surveys (see individual articles for details, i.e., galaz 2011 , galaz et al. 2010b , ö sterblom and sumaila 2011 , ö sterblom and bodin 2012 . all cases used the same methodology, combining document studies, simple network analysis, and semi-structured interviews with key international actors (a total of about 65 interviews for all cases). interviewees have been selected strategically to reflect an expected diversity of interests, resources, and role in the network collaboration of interest. the material has been structured and complemented with additional published and ''gray'' literature to elaborate the five overarching subjects below. these five subjects were identified during a series of author workshops and aimed to provide a structured overview of the perception of the problem to be addressed, the emergence of the network studied, as well as its function, effectiveness, and perceived legitimacy. original data sources and detailed methods are available in the literature cited. (1) what is the risk of a global change-induced ''tipping point''? a. what is the ''tipping point'' of interest? b. over which time frame does it operate and what is the response capacity required? c. what social and ecological uncertainties exist? d. what underlying social and ecological mechanisms increase the risk of ''tipping point'' behavior? (2) why is global coordination needed? (3) how did the global network emerge and evolve? a. how did the network emerge? b. which key actors/organizations were responsible for this development? c. what existing networks/governance features did these actors/organizations build on? d. how did the network develop over time and what is the current trajectory (how is capacity maintained and developed)? e. in what political context did the network emerge and develop? f. to what extent were they supported, or counteracted by state actors and/or other institutions? g. how is it coordinated (and what are the pros and cons of coordination)? h. what framework (legal or otherwise) is regulating network activities (and what are the pros and cons with the framework/lack of framework)? i. how are transparency issues addressed? j. what is known about the perceived legitimacy, fairness, and biases of the network? k. what are the primary tools of action for the network? (4) how is monitoring, sense-making, and coordinated responses enabled? a. what are the monitoring capacities of the network (is both ecological and social monitoring conducted)? b. how does the network achieve sense-making around ''tipping points''? c. how does the network enable rapid and coordinated responses? d. what role does information and communication technologies play in monitoring, sense-making, and response? e. what are major barriers for a continued evolution of the network? (5) what outputs and outcomes can be attributed to the network? a. what are the major outputs from the network? b. what are the most important outcomes from the network? case 1. pacfa: global partnership on climate, fisheries, and aquaculture oceans capture approximately one-third of anthropogenic emissions of greenhouse gases. this process is changing ocean chemistry, making oceans more acidic, with potentially enormous negative consequences for a wide range of marine species and societies as a result of losses of ecosystem services. recent research suggests that ocean acidification is likely to exhibit critical ''tipping points'' associated with rapid loss of coral reefs, as well as complex ocean-climate interactions affecting the oceans' capacities to capture carbon dioxide. the global arena for addressing these interrelated aspects of marine governance is characterized by a lack of effective coordination among the policy areas of marine biodiversity, fisheries, climate change, and ocean acidification. this has stimulated an attempt to better bridge these policy domains to increase coordination aimed at addressing potential critical tipping points. a number of international organizations primarily involved in fisheries initiated the global partnership on climate, fisheries, and aquaculture (from hereon pacfa) in 2008. currently, this initiative includes representatives from fao, unep, worldfish, the world bank, and 13 additional international organizations (fig. 1) . joint outputs include synthesizing information as a way to monitor the status of unfolding nonlinear dynamics, dissemination of information in science-policy workshops, and lobbying international arenas. 1a. pacfa is working with understanding and communicating the risk of tipping points related to ocean acidification, loss of carbon mitigating capacity of the oceans, and fish stock collapses. these tipping points are closely related to human activities (eutrophication, loss of marine biodiversity, degradation of coastal and marine habitats, and overfishing) with substantial impacts on livelihoods galaz et al. (2011) . 1b. tipping points related to these variables are possible within decades, but there is currently limited institutional capacity or development to address these integrated challenges galaz et al. (2011) . 1c. important uncertainties are related to the speed at which ocean acidification is likely to spread, where tipping point is located and what the social-ecological implications of such tipping points are? no agency or institutions is responsible, and the potential institutional development around this issue is also unclear dimitrov et al. 2007 ). 1d. (i) anthropogenic co 2 dissolves in the water and produces carbonic acid that strongly reduces the rate of calcification of marine organisms. (ii) there is a risk that the capacity of the oceans to act as a ''carbon sink'' is undermined through changed carbon cycle feedbacks. (iii) there are also possibilities for rapid collapse of fisheries as a result of multiple anthropogenic stresses (hoegh-guldberg et al. 2007; cox et al. 2000; hutchings and reynolds 2004) . 2. the problem domain is defined by multiple global environmental stresses (especially climate change and ocean acidification) and cannot be addressed at the sub-global level. the policy domain is characterized by a wide variety of global actors and initiatives galaz et al. (2011) . 3a. the network has evolved incrementally through informal personal contacts between centrally placed actors in international organizations. one of the networks first meetings of the was held in 2008, in rome, see galaz et al. (2011) . 3b. the fao, world bank, unep, and worldfish were key for developing the network galaz et al. (2011) . global networks and global change-induced tipping points 209 3c. the network emerged after repeated discussions between individuals centrally placed at international organizations. there was a perceived need to coordinate marine activities and connect them to the global climate agenda galaz et al. (2011) . 3d. the network started with a few individuals and their respective organizations, and has grown over time. pacfa has evolved from a loose communication network, to a formal partnership and recently (2010) became a un oceans-taskforce. 3 it is currently unclear how it will develop over time, but it will likely remain (as a minimum) a communication-based learning and coordinating network. evolution toward tangible joint field projects depends heavily on funding, see galaz et al. (2011) . 3e. in a context where climate change is high on the political agenda, but where there is a limited understanding of the associated challenges and limited political connection between those challenges. 3f. the evolution of the network has gained modest support from governments through the fao and its member states. the network successfully coordinated their activities with the indonesian government during the climate negotiations leading up to the meeting in copenhagen 2010 (cop 15). we have not identified any explicit opposition to the network. 3g. coordination is assumed by a small number of key organizations, centered around the fao. this institutional affiliation provides the network with legitimacy, connections, and necessary competence. however, it is also vulnerable as much of the coordination is centered around a small number of individuals, see galaz et al. (2011) . 3h. key coordinator of network has clear mandate by the fao, and its work is embedded in un law of the seas, and the convention on biological diversity galaz et al. (2011) . it should be noted however that ocean acidification and protection of marine ecosystems such as coral reef ecosystems have been denoted as ''non-regimes' ' dimitrov et al. (2007) . the lack of formal regimes or institutions devoted to ocean acidification has however not been identified as a problem by key actors in network. key barriers perceived by the network to further develop their work is rather a lack of funding and political support for the implementation of ecosystem-based approaches in marine systems, see galaz et al. (2011) . 3i. transparency is limited due to the informal structure of this network, see galaz et al. (2011) . 3j. a number of issues related to internal legitimacy have been raised by network members especially related to the use of scientific organizations as leverage to influence international policy processes . coordinating actors try to resolve external legitimacy challenges through actively recruiting members that secure a representation of diverse interests in the network, e.g., ranging from international organizations, to scientific organizations and ngo's in the global south. 3k. political lobbying aimed at influencing policy makers and policy processes through providing science-based synthesis describing the risk of transgressing possible future ''tipping points'' ). 4a. network members monitor different aspects of problems associated with climate changeocean acidification-marine biodiversity, however, only in a fragmented way. marine ecological parameters are the main focus of monitoring activities ). 4b. there is an explicit focus on the need for action before critical ''tipping points'' have been transgressed. the network describes this situation as ''self-regulating and buffering processes [could] break down, leading to irreversible consequences.'' 4 this message is part of the information material and technical reports produced by the network. 4c. the network primarily coordinates its activities around political opportunities. the network deals primarily with knowledge diffusion among its members, attempts to coordinate local coastal marine management or climate adaptation projects, and highlevel lobbying. scientific syntheses coordinated by the network play a key role for this last activity. for example, actors in the network were trying to gain international momentum for their cause and focused all their activities on trying to influence the international climate negotiations in copenhagen 2009, with an ambition to integrate marine issues in those discussions. this required the capacity of all core actors in the network to tap into resources and knowledge, including international scientific institutions (e.g., ices), international organizations (fao, world bank), and placebased research ngos (worldfish, with over a hundred partners, including universities and non-government organizations in the south). these syntheses not only provide an opportunity to bring together insights, but also feed into lobbying at high-level policy arenas ). 4d. e-mail and web pages are the primary tools for information dissemination and communication in the network 4e. funding over the long term and a lack of political support for implementing ecosystem-based management 5a. scientific reports and workshops where members are updated about scientific advances related to marine systems, and climate change and ocean acidification are the major outputs of the network. members also engage in continuous information sharing about relevant meetings. 5b. the outcomes have hitherto bee limited. the network does currently not have the necessary infrastructure or economic and human resources to engage in large-scale projects related to marine climate adaptation. case 2. global epidemic early warning and response networks the need for early and reliable warning of pending epidemics has been a major concern for the global community since the mid-nineteenth century, when cholera epidemics overran europe. the fact that there are strong disincentives for individual states to report disease outbreaks (due to associated losses of income from export and tourism) has created severe problems with detecting early warning signals and response problems. this is particularly critical, as coordinated responses are needed in order to avoid transgressing critical epidemic thresholds. a range of networks, from surveillance, laboratory, and expert networks, have emerged to secure information on early warning signals, mechanisms for rapid dissemination of information and coordinating responses. these networks are both global and regional, and include the who, the world organization for animal health (oie), the food and agriculture organization of the united nations (fao), the red cross, doctors without borders, and additional international and national organization. the networks facilitate responses to epidemic emergencies of international concern by providing early warning signals, rapid laboratory analysis, information dissemination, and coordination of activities in the field. 1a. there is a risk of uncontrolled spread of infectious disease through global transportation networks. complex interacting social (e.g., connectivity through international travel combined with limited infrastructure for public health in parts of the world) and ecological (land-use change, urbanization, habitat loss) drivers interact and may result in the spread of disease (galaz 2009 ). 1b. tipping points where disease becomes a matter of international concern can develop within the time frame of weeks. a global response capacity may be required within days or weeks, depending on disease characteristics (wallinga and teunis 2004) . 1c. key uncertainties include: what disease will emerge, when, where will it emerge, how fast can it spread, and how can it be addressed? 1d. the spread of disease will be increasingly difficult to control if the spread of the disease exceed the basic reproduction number r 0 [ 1. the basis reproduction number is defined by virulence, transmissibility, and severity (wallinga and teunis 2004) . 2. spread of novel infectious disease can become a global pandemic and require coordination among a diverse set of actors, including international organizations, laboratories, health ministries, and ngo's (in locations where international organizations have weak representation). 3a. international collaboration on infectious diseases originated in 1947 (the global influenza network coordinated by the who). there has been a rapid evolution of health-related networks (regional and/or disease specific), especially since the 1990s and 2000s. a coordination mechanism at the who specifically designed for rapid response to infectious disease emerged around 2000 and became formalized in 2005 (fidler 2004; galaz 2009 ). 3b. the who is the main coordinator of relevant networks; operation, and collaborating partners depends on disease and location. 3c. a number of serious disease outbreaks of international concern in the early 2000s (rift valley fever, ebola, anthrax, avian influenza, sars) led to a change in cooperation and the development of a new international legal framework (the ihrs) in 2005. an increased awareness of the potential global nature of pandemics thus enabled better coordination of networks that had developed from the 1990s (fidler 2004; galaz 2009 ). 3d. the networks have evolved incrementally over several decades, but with a rapid increase from 2003 and onwards. this development was stimulated by an increasing international concern and improved funding opportunities from international organizations and the usa (fidler 2004; galaz 2009 ). 3e. increasing international concerns of emerging infectious diseases from the 1990s and onward, such as outbreaks of avian influenza (years 1997, 2003-2004,) and us border protection security concerns related to terrorist attacks (deliberate introduction of lethal infectious disease) after the september 11 attacks in 2001, triggered rapid investments in global monitoring and response networks. these events also contributed to changes in the ihr in 2005 (fidler 2004 ). 3f. there is strong international support. however, some countries have also raised concerns. indonesia, for example, supplied h5n1 virus to the who global influenza surveillance network for analysis and preparation of vaccines for the world. the model where commercial companies produce the resulting vaccines was questioned by indonesia arguing that the produced vaccines would be unavailable to the country. in 2007, indonesia decided to suspend its sharing of viruses with the who, a decision that created serious tensions with the international community (fidler 2008) . more recently, there has also been concerns voiced by a wide set of groups about the connection between the who and pharmaceutical companies as the result of the last pandemic outbreak (''swineflu'' a/h1n1) (zarocostas 2011) . 3g. coordination is facilitated by international organizations such as the who, fao, and oie. the ability to rapidly receive, analyze, and disseminate alerts and responses is substantially facilitated by this coordination. rapid evaluation is key and is supported by crisis management structures such as early warning and confirmation mechanisms (galaz 2011 ). 3h. the activities are regulated by the ihrs and related technical plans [e.g., who's pandemic response plans (who 2005) ] provide formal guidance for coordination and action. however, these formal mechanisms are complemented with informal response procedures to maximize flexibility. the informal procedures create flexibility to emerging surprise events, but within a legal framework (who 2005) . however, responses to emerging infectious diseases such as avian influenza tend to be contested and may be conceptualized differently by different actors depending on national political agendas and cultural contexts (dry and leach 2010). 3i. who and associated partners redefine key policies based on criticism and changing circumstances. the importance of such activities has become apparent after controversial events, such as recent discussions about the role of who in issuing a pandemic alert for the 2009 outbreak of a/h1n1 (''swineflu''). 3j. the coordinating actors in the network (mainly who, fao) are generally viewed as legitimate, but concerns have also arisen. events that generated criticism include the association between the network and corporate interests (zarocostas 2011 ) and controversy about the best effective means to respond to outbreaks, and the distribution of costs related to these responses (dry and leach 2010). 3k. the networks take action through the coordination of information (laboratory analysis, early warning, confirmation) and collaboration around response options on the ground. 4a. a wide variety of sophisticated monitoring systems are in place and are enabled depending on the disease. the who is collaborating extensively with the canadian gphin, which has created an effective system for early detection of the spread of epidemic disease through web-monitoring. the use of satellite technology is important in some cases and is complemented with information platforms such as healthmap and the moderated e-mail list promed (galaz 2011) . monitoring centers on reports on disease outbreaks, but underlying drivers (e.g., land-use change and urbanization trends) are not included in such monitoring (galaz 2009 ). 4b. actions and investments are coordinated around pandemic phases, where action at the national and international level is dependent on the current assessment of the current status and spread of the disease (the pandemic phase), but also on the need for actions that maintain spread within certain levels (who 2005). 4c. through well-developed formal and informal mechanisms for cooperation, elaborate monitoring systems and large geographical spread and capacity (galaz 2011 ). 4d. information and communication technologies are crucial in all aspects of coordinated network responses, ranging from early warnings to on-the-ground responses. who regularly sets up videoconferences and secure web pages to facilitate rapid scientific collaboration on urgent epidemic issues (galaz 2011) . 4e. there are concerns that reduced future funding will hamper current monitoring and response infrastructure. there are also concerns that additional funding and political interest for investigating in general (preventive) health infrastructure is lacking in many developing countries (iom and nrc 2008) . critics have raised the issue that the focus is perhaps too much on disease of international concern, thereby obscuring other and more urgent national epidemic needs (dry and leach 2010) . 5a. the network has generated a large number of reports and technical documents (e.g., 12 and the who weekly epidemiological record available online: http://www.who. int/wer/en/). continuous national, regional, and national meetings are being conducted for information sharing and capacity building (both disease specific and with regional focus). 5b. outcomes are difficult to measure as evaluation criteria are contested. there are several successful case studies where unprecedented rapid international responses have been reported, especially for severe acute respiratory syndrome (sars) in 2002. a synthesis by (chan et al. 2010) shows that international support in epidemic contingencies over time has become more timely due to changes in the ihrs and improved surveillance. case 3. illegal, unreported, and unregulated fishing and the commission for the conservation of antarctic marine living resources when illegal, iuu fishing for patagonian toothfish emerged in the southern ocean in the mid-1990s, it was perceived that if unchecked, it would lead to the collapse of valuable fish stocks and endangered seabird populations (table 1) . illegal overfishing around south african sub-antarctic islands consequently resulted in a collapse of these stocks, in turn resulting in estimated losses in the order of hundreds of millions us dollars. the toothfish market is essentially global, where products are caught around the antarctic, landed in the southern hemisphere, and primarily consumed in the usa, japan, and europe. the fact that actors involved in iuu fishing are also dispersed globally necessitates coordination between states on all continents. the regional mandate of the existing international governance institution, ccamlr (commission for the ccamlr), initially proved insufficient to address this global challenge. as a response, state and non-state actors increasingly developed their networks and are currently operating at the global level (fig. 1, si) . continuous coordinated international responses around adjacent islands have resulted in a dramatic reduction in iuu fishing. states now have a well-developed mechanism for coordinated monitoring and response when new actors, markets, or iuu fishing are emerging in the southern ocean. 1a. the scientific committee of the commission for the ccamlr has repeatedly warned for the imminent risk if collapsing valuable fish stocks and the extinction of globally threatened seabirds throughout the southern ocean (sc-camlr 1997 , sc-camlr 2002 . this has caused serious concern in the commission and threatened perceived critical commercial and environmental interest. passing regional tipping points for these resources would represent a failure of ccamlr to live up to the convention on the camlr, which explicitly state that these resources should be protected for the benefit of mankind. 1b. tipping points related to fish stocks and seabirds was expected within years, and a coordinated response was required before such tipping points were passed (ö sterblom and sumaila 2011). 1c. key uncertainties include where iuu fishing (in the southern ocean) emerge next, where the products are landed and which actors and states would be involved in such activities (ö sterblom et al. 2010 )? 1d. a failure to regulate coastal fisheries and fishing capacity has resulted in collapsing fish stocks combined with an overcapacity of existing fishing fleets. such fleets will move elsewhere (further south, further offshore, and fishing at larger depths (pauly et al. 2003; swartz et al. 2010) , which may result in iuu fishing (agnew et al. 2009 ). such iuu fishing dramatically reduced naturally long-lived and late maturing fish and seabird populations to low levels in the southern ocean (sc-camlr 1997; ö sterblom et al. 2010; miller et al. 2010 ). 2. the problem is global as iuu vessels are highly mobile across large geographical areas. these vessels may change their flag state, call sign and other identifying markers to avoid detection and apprehension (ö sterblom et al. 2010) . iuu actors use loopholes in international legal frameworks to their advantage and adapt faster than formal institutions develop (ö sterblom and sumaila 2011; ö sterblom et al. 2010). 3a. the network started to emerge from 1997 as a result of the scientific information provided to the meeting of ccamlr that year (sc-camlr 1997). 3b. the australian government, together with ngos from australia and norway, collaborated with licensed fishing industries to collect initial information (fallon and kriwoken 2004 ). 3c. the global network evolved from ccamlr, an organization with a regional mandate. the rationale for the network to start cooperating informally was a perceived inability of governments to take actions before tipping points would be passed (ö sterblom and bodin 2012). 3d. the network used ccamlr as a basis for action and developed starting with a small number of countries, which invested substantially in monitoring and enforcement, policy making and diplomatic demarches (ö sterblom and sumaila 2011; ö sterblom and bodin 2012). a small number of ngos also cooperated to collect and spread information, in collaboration with licensed fishing companies (fallon and kriwoken 2004) . over time, an increasing number of states and non-state actors became involved in collecting and sharing information and formalizing cooperation (ö sterblom and sumaila 2011). members of the network have carried out training workshops in strategic locations. strategic capacity building has been directed toward regions with limited capacity but where ports have been used for offloading iuu catches. 3e. depletion of coastal fish resources shifted commercial fishing enterprises, but also political interests, toward the high seas (beyond the 200 nautical mile zone). this led to increasing global policy concern about iuu fishing and could also be politically connected to border protection security. this was evident in australia and the usa, particularly after the september 11 attacks in new york city in 2001, the tampa affair and the bali bombings in 2002 (ö sterblom and sumaila 2011; constable et al. 2010 ). 3f. addressing iuu fishing has been a key challenge for ccamlr as some member states have been associated with iuu fishing, e.g., as flag, or port states or with their nationals working on board iuu vessels or as owners of iuu companies. this has at times resulted in substantial controversy and heated debate (ö sterblom and sumaila 2011). 3g. ccamlr has a well-developed mechanism for obtaining, processing, and sharing of information (from state and non-state actors) through its secretariat. non-state actors are improving the effectiveness of monitoring and enforcement, by for example reporting on suspected vessel sightings, and by creating peer-pressure directed at states and corporations associated with iuu fishing (ö sterblom and bodin 2012). 3h. an international convention (camlr) is regulating the activities within ccamlr, and governments are also bound by global legal agreements (un-fsa 1995) and codes of conducts (fao 2005) . information sharing has been facilitated by ccamlr protocols specifically designed to address iuu fishing. ccamlr provides an operational and transparent platform for cooperation but new measures can be hampered by the need to arrive at consensus. 3i. all commission reports are published online but background reports, including reports on diplomatically sensitive issues, are only published at a password protected web site for delegates 3j. disputes about some sub-antarctic islands (notably between argentina and the uk) represent a long-standing conflict. arrested iuu operators have expressed the opinion that exclusive economic zones in the southern ocean are illegitimate. several nations have clear benefits from fishing and also conduct expensive monitoring and enforcement. there are no benefit or burden sharing mechanisms in place (ö sterblom and sumaila 2011). 3k. operational management, coordinated response, and policy making are the primary tools for action. countries collaborate to monitor and enforce compliance and form alliances to develop new policies (ö sterblom and bodin 2012). 4a. cooperative international scientific surveys document the dynamics of fish stocks (constable et al. 2010 ). this network around ccamlr to address iuu fishing monitor fishing vessel activities and trade flows (ö sterblom and bodin 2012). 4b. the scientific committee reports annually on the estimated levels of iuu catches and estimate the risk of tipping points, both in relation to fish stocks and seabird populations. a corresponding committee for evaluating compliance-related information review, at an annual basis, related social information (trade flows, vessel activities, etc.) (ö sterblom and bodin 2012). 4c. the secretariat serves as an important hub in the network by coordinating information flows. responses are coordinated through either rapid engagement of partners to secure the seizure of a catch or vessel (between states), cooperation to perform alternative investigations (the fishing industry) or information campaigns (ngo community), or political lobbying for new policy tools (ö sterblom and bodin 2012). 4d. satellite technology monitor vessel activities and an electronic catch documentation schemes provide information on trade flows (ö sterblom and bodin 2012). forensic accounting is an important tool for criminal investigations, which has been pioneered by us agencies and applied in convicting cases associated with iuu operations in the ccamlr area. 4e. financial costs (monitoring is very expensive) depends on a small number of actors, governance capacity is limited in several countries used as port or flag states, and political will may influence the capacity of ccamlr to adapt to new challenges. 5a. ccamlr have developed a number of compliance mechanisms over time and is increasingly developing formal and informal governance mechanisms in order to effectively respond to iuu fishing in the southern ocean (ö sterblom and sumaila 2011). 5b. continuous coordinated international responses around sub-antarctic islands and beyond have resulted in a dramatic reduction in iuu fishing (ö sterblom and bodin 2012). nested and teleconnected vulnerabilities to environmental change estimating the world-wide extent of illegal fishing network institutionalism illegal, unreported and unregulated fishing: an analysis of the legal, economic and historical factors relevant to its development and persistence the ties that lead: a social network approach to leadership climate negotiations under scientific uncertainty framing processes and social movements: an overview and assessment earth system governance' as a crosscutting theme of global change research planetary boundaries and earth system governance: exploring the links navigating the anthropocene: improving earth system governance global environmental governance reconsidered the role of social networks in natural resource governance: what relational patterns make a difference? social networks and natural resource management the politics of crisis management-public leadership under pressure does the terrestrial biosphere have planetary tipping points? ideas, politics, and public policy scale and cross-scale dynamics: governance and information in a multilevel world global capacity for emerging infectious disease detection cross-country experiences and policy implications from the global financial crisis designing policy for action: the emergency management system managing fisheries to conserve the antarctic marine ecosystem: practical implementation of the convention on the conservation of antarctic marine living resources (ccamlr) acceleration of global warming due to carbon-cycle feedbacks in a coupled climate model global networks and global change-induced tipping points 217 rethinking epistemic communities twenty years later the securitisation of avian influenza: international discourses and domestic politics in asia the struggle to govern the commons international nonregimes: a research agenda information exchange and the robustness of organizational networks constructing effective environmental regimes governance and complexity-emerging issues for governance theory connective capacities of network managers network analysis, culture, and the problem of agency international influence of an australian nongovernment organization in the protection of patagonian toothfish report of the technical consultation to review progress and promote the full implementation of the international plan of action to prevent, deter and eliminate illegal, unreported and unregulated fishing and the international plan of action for the management of fishing capacity sars, governance and the globalization of disease influenza virus samples, international law, and global health diplomacy adaptive governance of social-ecological systems pandemic 2.0: can information technology really help us save the planet? double complexity-information technology and reconfigurations in adaptive governance global environmental governance, technology and politics: the anthropocene gap planetary boundaries-exploring the challenges for global environmental governance can web crawlers revolutionize ecological monitoring? polycentric systems and interacting planetary boundaries: emerging governance of climate change-ocean acidification-marine biodiversity the institutional dimensions of global environmental change: principal findings and future directions institutional and political leadership dimensions of cascading ecological crises saved by disaster? abrupt climate change, political inertia, and the possibility of an intergenerational arms race the causal mechanisms of interaction between international institutions what is a case study and what is it good for? pushing the limits of the law of the sea convention: australian and french cooperative surveillance and enforcement in the southern ocean banning chlorofluorocarbons: epistemic community efforts to protect stratospheric ozone sars and emerging infectious diseases: a challenge to place global solidarity above national sovereignty coral reefs under rapid climate change and ocean acidification adaptive environmental assessment and management social network analysis and the evaluation of leadership networks climate change, justice and sustainability: linking climate and development policy marine fish population collapses: consequences for recovery and extinction risk achieving sustainable global capacity for surveillance and response to emerging diseases of zoonotic origin: workshop report the rise of governance and the risks of failure: the case of economic development a charter moment: restructuring governance for sustainability resilience and regime shifts: assessing cascading effects the 2c target reconsidered managing uncertainties in networks: a network approach to problem solving and decision making tipping elements in the earth's climate system navigating social-ecological systems-building resilience for complexity and change mental models in human-environment interactions: theory, policy implications, and methodological explorations presented to the university of waterloo in fulfillment of the thesis requirement for the degree of doctor of philosophy in global governance iuu fishing in antarctic waters: ccamlr actions and regulations a quantitative approach to evaluating international environmental regimes theories of communication networks learning under uncertainty: networks in crisis management managing institutional complexity-regime interplay and global environmental change future global shocks-improving risk governance. oecd reviews of risk management policies regime complexes: a buzz, a boom, or a boost for global governance? global cooperation among diverse organizations to reduce illegal fishing in the southern ocean illegal fishing and the organized crime analogy global networks and global change-induced tipping points 219 toothfish crises, actor diversity and the emergence of compliance mechanisms in the southern ocean adapting to regional enforcement: fishing down the governance index a conceptual framework for analyzing adaptive capacity and multi-level learning processes in resource governance regimes the future for fisheries reframing crisis management adaptive comanagement: a systematic review and analysis planetary boundaries: exploring the safe operating space for humanity a safe operating space for humanity scientific committee for the conservation of antarctic marine living resources) scientific committee for the conservation of antarctic marine living resources) early-warning signals for critical transitions catastrophic regime shifts in ecosystems: linking theory to observation planetary boundaries: thresholds risk prolonged degradation the anthropocene: from global change to planetary stewardship managing legitimacy: strategic and institutional approaches the united nations agreement for the implementation of the provisions of the united nations convention on the law of the sea of 10 december 1982 relating to the conservation and management of straddling fish stocks and highly migratory fish stocks instantiating global crisis networks: the case of sars governance'' as a bridge between disciplines: crossdisciplinary inspiration regarding shifts in governance and problems of governability, accountability and legitimacy looming global-scale failures and missing institutions a handful of heuristics and some propositions for understanding resilience in social-ecological systems different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures who global influenza preparedness plan-the role of who and recommendations for national measures before and during pandemics multilevel governance of global environmental change-perspectives from science, sociology and the law the institutional dimensions of global environmental change: principal findings and future directions institutional dynamics: resilience, vulnerability and adaptation in environmental and resource regimes effectiveness of international environmental regimes: existing knowledge, cuttingedge themes, and research strategies who processes on dealing with a pandemic need to be overhauled and made more transparent key: cord-104001-5clslvqb authors: wang, xiaoqi; yang, yaning; liao, xiangke; li, lenli; li, fei; peng, shaoliang title: selfrl: two-level self-supervised transformer representation learning for link prediction of heterogeneous biomedical networks date: 2020-10-21 journal: biorxiv doi: 10.1101/2020.10.20.347153 sha: doc_id: 104001 cord_uid: 5clslvqb predicting potential links in heterogeneous biomedical networks (hbns) can greatly benefit various important biomedical problem. however, the self-supervised representation learning for link prediction in hbns has been slightly explored in previous researches. therefore, this study proposes a two-level self-supervised representation learning, namely selfrl, for link prediction in heterogeneous biomedical networks. the meta path detection-based self-supervised learning task is proposed to learn representation vectors that can capture the global-level structure and semantic feature in hbns. the vertex entity mask-based self-supervised learning mechanism is designed to enhance local association of vertices. finally, the representations from two tasks are concatenated to generate high-quality representation vectors. the results of link prediction on six datasets show selfrl outperforms 25 state-of-the-art methods. in particular, selfrl reveals great performance with results close to 1 in terms of auc and aupr on the neodti-net dataset. in addition, the pubmed publications demonstrate that nine out of ten drugs screened by selfrl can inhibit the cytokine storm in covid-19 patients. in summary, selfrl provides a general frame-work that develops self-supervised learning tasks with unlabeled data to obtain promising representations for improving link prediction. in recent decades, networks have been widely used to represent biomedical entities (as nodes) and their relations (as edges). predicting potential links in heterogeneous biomedical networks (hbns) can be beneficial to various significant biology and medicine problems, such as target identification, drug repositioning, and adverse drug reaction predictions. for example, network-based drug repositioning methods have already offered promising insights to boost the effective treatment of covid-19 disease (zeng et al. 2020; xiaoqi et al. 2020) , since it outbreak in december of 2019. many network-based learning approaches have been developed to facilitate link prediction in hbns. in particularly, network representation learning methods, that aim at converting high-dimensionality networks into a low-dimensional space while maximally preserve structural * to whom correspondence should be addressed: fei li (pitta-cus@gmail.com) , and shaoliang peng (slpeng@hnu.edu.cn). properties (cui et al. 2019) , have provided effective and potential paradigms for link prediction li et al. 2017) . nevertheless, most of the network representation learning-based link prediction approaches heavily depend on a large amount of labeled data. the requirement of large-scale labeled data may not be met in many real link prediction for biomedical networks (su et al. 2020) . to address this issue, many studies have focused on developing unsupervised representation learning algorithms that use the network structure and vertex attributes to learn low-dimension vectors of nodes in networks (yuxiao et al. 2020) , such as grarep (cao, lu, and xu 2015) , tadw (cheng et al. 2015) , line (tang et al. 2015) , and struc2vec (ribeiro, saverese, and figueiredo 2017) . however, these network presentation learning approaches are aimed at homogeneous network, and cannot applied directly to the hbns. therefore, a growth number of studies have integrated meta paths, which are able to capture topological structure feature and relevant semantic, to develop representation learning approaches for heterogeneous information networks. dong et al. used meta path based random walk and then leveraged a skip-gram model to learn node representation (dong, chawla, and swami 2017). shi et al. proposed a fusion approach to integrate different representations based on different meta paths into a single representation (shi et al. 2019 ). ji et al. developed the attention-based meta path fusion for heterogeneous information network embedding (ji, shi, and wang 2018) . wang et al. proposed a meta path-driven deep representation learning for a heterogeneous drug network (xiaoqi et al. 2020) . unfortunately, most of the meta path-based network representation approaches focused on preserving vertex-level information by formalizing meta paths and then leveraging a word embedding model to learn node representation. therefore, the global-level structure and semantic information among vertices in heterogeneous networks is hard to be fully modeled. in addition, these representation approaches is not specially designed for link prediction, thus resulting in learning an inexplicit representation for link prediction. on the other hand, self-supervised learning, which is a form of unsupervised learning, has been receiving more and more attention. self-supervised representation learning for-mulates some pretext tasks using only unlabeled data to learn representation vector without any manual annotations (xiao et al. 2020) . self-supervised representation learning technologies have been widely use for various domains, such as natural language processing, computer vision, and image processing. however, very few approaches have been generalized for hbns because the structure and semantic information of heterogeneous networks is significantly differ between domains, and the model trained on a pretext task may be unsuitable for link prediction tasks. based on the above analysis, there are two main problems in link prediction based on network representation learning. the first one is how to design a self-supervised representation learning approach based on a great amount of unlabeled data to learn low-dimension vectors that integrate the differentview structure and semantic information of hbns. the second one is how to ensure the pretext tasks in self-supervised representation learning be beneficial for link prediction of hbns. in order to overcome the mentioned issues, this study proposes a two-level self-supervised representation learning (selfrl) for link prediction in heterogeneous biomedical networks. first, a meta path detection self-supervised learning mechanism is developed to train a deep transformer encoder for learning low-dimensional representations that capture the path-level information on hbns. meanwhile, sel-frl integrates the vertex entity mask task to learn local association of vertices in hbns. finally, the representations from the entity mask and meta path detection is concatenated for generating the embedding vectors of nodes in hbns. the results of link prediction on six datasets show that the proposed selfrl is superior to 25 state-of-the-art methods. in summary, the contributions of the paper are listed below: • we proposed a two-level self-supervised representation learning method for hbns, where this study integrates the meta path detection and vertex entity mask selfsupervised learning task based on a great number of unlabeled data to learn high quality representation vector of vertices. • the meta path detection self-supervised learning task is developed to capture the global-level structure and semantic feature of hbns. meanwhile, vertex entity-masked model is designed to learn local association of nodes. therefore, the representation vectors of selfrl integrate two-level structure and semantic feature of hbns. • the meta path detection task is specifically designed for link prediction. the experimental results indicate that selfrl outperforms 25 state-of-the-art methods on six datasets. in particular, selfrl reveals great performance with results close to 1 in terms of auc and aupr on the neodti-net dataset. heterogeneous biomedical network a heterogeneous biomedical network is defined as g = (v, e) where v denotes a biomedical entity set, and e rep-resents a biomedical link set. in a heterogeneous biomedical network, using a mapping function of vertex type φ(v) : v → a and a mapping function of relation type ψ(e) : e → r to associate each vertex v and each edge e, respectively. a and r denote the sets of the entity and relation types, where |a| + |r| > 2. for a given heterogeneous network g = (v, e), the network schema t g can be defined as a directed graph defined over object types a and link types r, that is, t g = (a, r). the schema of a heterogeneous biomedical network expresses all allowable relation types between different type of vertices, as shown in figure 1 . figure 1 : schema of the heterogeneous biomedical network that includes four types of vertices (i.e., drug, protein, disease, and side-effect). network representation learning plays a significant role in various network analysis tasks, such as community detection, link prediction, and node classification. therefore, network representation learning has been receiving more and more attention during recent decades. network representation learning aims at learning low-dimensional representations of network vertices, such that proximities between them in the original space are preserved (cui et al. 2019 ). the network representation learning approaches can be roughly categorized into three groups: matrix factorizationbased network representation learning approaches, random walk-based network representation learning approaches, and neural network-based network representation learning approaches (yue et al. 2019 ). the matrix factorization-based network representation learning methods extract an adjacency matrix, and factorize it to obtain the representation vectors of vertices, such as, laplacian eigenmaps (belkin and niyogi 2002) and the locally linear embedding methods (roweis and saul 2000) . the traditional matrix factorization has many variants that often focus on factorizing the high-order data matrix, such as, grarep (cao, lu, and xu 2015) and hope (ou et al. 2016) . inspired by the word2vec (mikolov et al. 2013) ... 2014), node2vec (grover and leskovec 2016) , and metap-ath2vec/metapath2vec++ (dong, chawla, and swami 2017), in which a network is transformed into node sequences. these models were later extended by struc2vec (ribeiro, saverese, and figueiredo 2017) for the purpose of better modeling the structural identity. over the past years, neural network models have been widely used in various domains, and they have also been applied to the network representation learning areas. in neural network-based network representation learning, different methods adopt different learning architectures and various network information as input. for example, the line (tang et al. 2015) aims at embedding by preserving both local and global network structure properties. the sdne (wang, cui, and zhu 2016) and dngr (cao 2016) were developed using deep autoencoder architecture. the graphgan (wang et al. 2017 ) adopts generative adversarial networks to model the connectivity of nodes. predicting potential links in hbns can greatly benefit various important biomedical problems. this study proposes selfrl that is a two-level self-supervised representation learning algorithm, to improve the quality of link prediction. the flowchart of the proposed selfrl is shown in figure 2 . considering meta path reflecting heterogeneous characteristics and rich semantics, selfrl first uses a random walk strategy guided by meta-paths to generate node sequences that are treated as the true paths of hbns. meanwhile, an equal number of false paths is produced by randomly replacing some of the nodes in each of true path. then, based on the true paths, this work proposes vertex entity masked as self-supervised learning task to train deep transformer encoder for learning entity-level representations. in addition, a meta path detection-based self-supervised learning task based on all true and false paths is designed to train a deep transformer encoder for learning path-level representation vectors. finally, the representations obtained from the twolevel self-supervised learning task are concatenated to generate the embedding vectors of vertices in hbns, and then are used for link prediction. true path generation a meta-path is a composite relation denoting a sequence of adjacent links between nodes a 1 and a i in a heterogeneous network, and can be expressed in the where r i represents a schema between two objects. different adjacent links indicate distinct semantics. in this study, all the meta paths are reversible, and no longer than four nodes. this is based on the results of the previous studies that meta paths longer than four nodes may be too long to contribute to the informative feature (fu et al. 2016 ). in addition, sun et al. have suggested that short meta paths are good enough, and that long meta paths may even reduce the quality of semantic meanings (sun et al. 2011) . in this work, each network vertex and meta path are regarded as vocabulary and sentence, respectively. indeed, a large percentage of meta paths are biased to highly visible objects. therefore, three key steps are defined to keep a balance between different semantic types of meta paths, and they are as follows: (1) generate all sequences according to meta paths whose positive and reverse directional sampling probabilities are the same and equal to 0.5. (2) count the total number of meta paths of each type, and calculate their median value (n ); (3) randomly select n paths if the total number of meta paths of each type is larger than n ; otherwise, select all sequences. the selected paths are able to reflect topological structures and interaction mechanisms between vertices in hbns, and will be used to design selfsupervised learning task to learn low-dimensional representations of network vertices. false path generation the paths selected using the above procedure are treated as the true paths in hbns. the equal number of false paths are produced by randomly replacing some nodes in each of the true paths. in other words, each true path corresponds to a false path. there is no relation between the permutation nodes and context in false paths, and the number of replaced nodes is less than the length of the current path. for instance, a true path (i.e., d3 p8 d4 e9) is shown in figure 2 (b). during the generation procedure of false paths, the 1st and 3rd tokens that correspond to d3 and d4, respectively, are randomly chosen, and two nodes from the hbns which correspond to d2 and d1, respectively, are also randomly chosen. if there is a relationship between d2 and p8, node d3 is replaced with p2. if there is a relationship between d2 and p8, another node from the network is chosen until the mentioned conditions are satisfied. similarly, node d4 is replaced with d1, because there are no relations between d1 and e9 (or p8). finally, the path (i.e., d2 p8 d1 e9) is treated as a false path. meta path detection in general language understanding evaluation, the corpus of linguistic acceptability (cola) is a binary classification task, where the goal is to predict whether a sentence is linguistically acceptable or not ). in addition, perozzi et al. have suggested that paths generated by short random walks can be regarded as short sentences (perozzi, alrfou, and skiena 2014) . inspired by their work, this study assumes that true paths can be treated as linguistically acceptable sentences, while the false paths can be regarded as linguistically unacceptable sentences. based on this hypothesis, we proposes the meta path detection task where the goal is to predict whether a path is acceptable or not. in the proposed selfrl, a set of true and false paths is fed into the deep transformer encoder for learning path-level representation vector. selfrl maps a path of symbol representations to the output vector of continuous representations that is fed into the softmax function to predict whether a path is a true or false path. apparently, the only distinction between true and false paths is whether there is an association between nodes of path sequence. therefore, the meta path detection task is the extension of the link prediction to a certain extent. especially, when a path includes only two nodes, the meta path detection is equal to the link prediction. for instance, judging whether a path is a true or false path, e.g., d1 s5 in figure b , is the same as predicting whether there is a relation between d1 and s5. however, the meta path detection task is generally more difficult compared to link prediction, because it requires the understanding of long-range composite relationships between vertices of hbns. therefore, the meta path detection-based self-supervised learning task encourages to capturing high-level structure and semantic information in hbns, thus facilitating the performance of link prediction. in order to capture the local information on hbns, this study develops the vertex entity mask-based self-supervised learning task, where nodes in true paths are randomly masked, and then predicting those masked nodes. the vertex entity mask task has been widely applied to natural language processing. however, using the vertex entity mask task to drive the heterogeneous biomedical network representation model is a less explored research. in this work, the vertex entity mask task fellows the implementation described in the bert, and the implementation is almost identical to the original (devlin et al. 2018) . in brief, 15% of the vertex en-tities in true paths are randomly chosen for prediction. for each selected vertex entity, there are three different operations for improving the model generalization performance. the selected vertex entity is replaced with the ¡mask¿ token for 80% time, and is replaced with a random node for 10% time. furthermore, it has 10% chance to keep the original vertex. finally, the masked path is used for training a deep transformer encoder model according to the vertex entity mask task where the last hidden vectors corresponding to the mask vertex entities are fed into the softmax function to predict their original vertices with cross entropy loss. the vertex entity mask task can keep a local contextual representation of every vertex. the vertex entity mask-based self-supervised learning task captures the local association of the vertex in hbns. the meta path detection-based self-supervised learning task enhances the global-level structure and semantic features of the hbns. therefore, the two-level representations are concatenated as the final embedding vectors that integrate structure and semantics information on hbns from different level, as shown in figure 2 (f). layer normalization the model of selfrl is a deep transformer encoder, and the implementation is almost identical to the original (vaswani et al. 2017) . the selfrl follows the overall architecture that includes the stacked self-attention and point-wise, fully connected layers, and softmax function, as shown in figure 3 . multi-head attention an attention function can be described as mapping a query vectors and a set of key-value pairs to an output vectors. the multi-head attention leads table 1 : the node and edge statistics of the datasets. here, ddi, dti, dsa, dda, pda, ppi represent the drug-drug interaction, drug-target interaction, drug-side-effect association, and drug-disease association, protein-disease association and protein-protein interaction, respectively. where w o is a parameter matrices, and h i is the attention function of i-th subspace, and is given as follows: respectively denotes the query, key, and value representations of the i-th subspace, and w is parameter matrices which represent that q, k, and v are transformed into h i subspaces, and d and d k hi represent the dimensionality of the model and h i submodel. position-wise feed-forward network in addition to multi-head attention layers, the proposed selfrl model include a fully connected feed-forward network, which includes two linear transformations with a relu activation function, is given as follows: there are the same the linear transformations for various positions, while these linear transformations use various parameters from layer to layer. residual connection for each sub-layer, a residual connection and normalization mechanism are employed. that is, the output of each sub-layer is given as follows: where x and f (x) stand for input and the transformational function of each sub-layer, respectively. in this work, the performance of selfrl is evaluated comprehensively by link prediction on six datasets. the results of selfrl is also compared with the results of 25 methods. for neodti-net datasets, the performance of selfrl is compared with those of seven state-of-the-art methods, including mscmf ( . the details on how to set the hyperparameters in above baseline approaches can be found in neodti (wan et al. 2018) . for deepdr-net datasets, the link prediction results generated by selfrl are compared with that of seven baseline algorithms, including deepdr (zeng et al. 2019) , dtinet (luo et al. 2017) , kernelized bayesian matrix factorization (kbmf) (gonen and kaski 2014) , support vector machine (svm) (cortes and vapnik 1995) , random forest (rf) (l 2001), random walk with restart (rwr) (cao et al. 2014) , and katz (singhblom et al. 2013) . the details of the baseline approaches and hyperparameters selection can be seen in deepdr (zeng et al. 2019) . for single network datasets, selfrl is compared with 11 network representation methods, that is laplacian (belkin and niyogi 2003) , singular value decomposition (svd), graph factorization (gf) (ahmed et al. 2013) , hope (ou et al. 2016) , grarep (cao, lu, and xu 2015) , deepwalk (perozzi, alrfou, and skiena 2014) , node2vec (grover and leskovec 2016) , struc2vec (ribeiro, saverese, and figueiredo 2017) , line (tang et al. 2015) , sdne (wang, cui, and zhu 2016) , and gae (kipf and welling 2016) . more implementation details can be found in bionev (yue et al. 2019) . the hyperparameters selection of baseline methods were set to default values, and the original data of neodti (wan et al. 2018) , deepdr (zeng et al. 2019) , and bionev (yue et al. 2019) were used in the experiments. the parameters of the proposed selfrl follows those of the bert (devlin et al. 2018 ) which the number of transformer blocks (l), the number of self-attention heads (a), and the hidden size (h) is set to 12, 12, and 768, respectively. for the neodti-net dataset, the embedding vectors are fed into the inductive matrix completion model (imc) (jain and dhillon 2013) to predict dti. the number of negative samples that are randomly chosen from negative pairs, is ten times that of positive samples according to the guidelines in neodti (wan et al. 2018) . then, to reduce the data bias, the ten-fold cross-validation is performed repeatedly ten times, and the average value is calculated. for the deepdr-net dataset, a collective variational autoencoder (cvae) is used to predict dda. all positive samples and the same number of negative samples that is randomly selected from unknown pairs are used to train and test the model according to the guidelines in deepdr (zeng et al. 2019) . then, five-fold crossvalidation is performed repeatedly 10 times. for neodti-net and deepdr-net datasets, the area under precision recall (aupr) curve and the area under receiver operating characteristic (auc) curve are adopted to evaluate the link prediction performance generated by all approaches. for other datasets, the representation vectors are fed into the logistic regression binary classifier for link prediction, the training set (80%) and the testing set (20%) consisted of the equal number of positive samples and negative samples that is randomly selected from all the unknown interactions according to the guidelines in bionev. the performance of different methods is evaluated by accuracy (acc), auc, and f1 score. the overall performances of all methods for dti prediction on the neodti-net dataset are presented in figure 4 . selfrl shows great results with the auc and aupr value close to 1, and significantly outperformed the baseline methods. in particular, neodti and dtinet were specially developed for the neodti-net dataset. however, selfrl is still superior to both neodti and dtinet, improving the aupr by approximately 10% and 15%, respectively. the results of dda prediction of selfrl and baseline methods are represented in figure 5 . these experimental results demonstrate that selfrl generates better results of the dda prediction on the deepdr-net dataset than the baseline methods. however, selfrl achieves the improvements in term of auc and aupr less than 2%. a major reason for such a poor superiority of the selfrl to the other methods is that selfrl considers only four types of objects and edges. however, deepdr included 12 types of vertices and 11 types of edges of drug-related data. in addition, deepdr specially integrated multi-modal deep autoencoder (mda) and cvae model to improve the dda prediction on the deepdr-net dataset. unfortunately, the selfrl+cvae combination maybe reduce the original balance between the mda and cvae. the above results and analysis indicate that the proposed selfrl is a powerful network representation approach for complex heterogeneous networks, and that can achieve very promising results in link prediction. such a good performance of the proposed selfrl is due to the following facts: (1) selfrl designs a two-level self-supervised learning task to integrate the local association of a node and the global level information of hbns. (2) meta path detection selfsupervised learning task that is an extension of link prediction, is specially designed for link prediction. in particular, path detection of two nodes is equal to link prediction. therefore, the representation generated by meta path detection is able to facilitate the link prediction performance. (3) selfrl uses meta paths to integrate the structural and semantic features of hbns. in this section, the link prediction results on four single network datasets are presented to further verify the representable 2 , and the best results are marked in boldface. selfrl shows higher accuracy in link prediction on four single networks compared to the other 11 baseline approaches. especially, the proposed selfrl can achieves an approximately 2% improvement in terms of auc and acc over the second best method on the string-ppi dataset. the auc value of link prediction on the ndfrt-dda dataset is improved from 0.963 to 0.971 when selfrl is compared with grarep. however, grarep only achieves an enhancement of 0.001 compared to line that is the third best method on the string-ppi dataset. therefore, the improvement of selfrl is significant in comparison to the enhancement of grarep compared to line. meanwhile, we also notice that selfrl have poor superiority to the second best method on the ctd-dda and drugbank-ddi datasets. one possible reason for this result can be that the structure and semantic of the ctd-dda and drugbank-ddi datasets are simple and monotonous, so most of the network representation approaches are able to achieve good performance on them. consequently, the proposed selfrl is a potential representation method for the single network datasets, and can contribute to link prediction by introducing a two-level self-supervised learning task. in the neodti and deepdr, low-dimensional representations of nodes in hbns are first learned by network representation approaches, and then are fed into classifier models for predicting potential link among vertices. to further examine the contribution of the network representation approaches, the low-dimensional representation vector is fed into svm that is a traditional and popular classifier for link prediction. the experimental results of these combinations are shown in table 3 . selfrl achieves the best per-formance in link prediction for complex heterogeneous networks, providing a great improvement of over 10% with regard to auc and aupr compared to the neodti and deepdr. with the change of classifiers, the result of sel-frl in link prediction reduced from 0.988 to 0.962 on the neodti-net dataset, while the auc value of neodti approximately reduce by 9%. interestingly, the results on the deepdr-net dataset are similar. therefore, the experimental results indicate that the network representation performance of selfrl is more robust and better than those of the other embedding approaches. this is mainly because selfrl integrates a two-level self-supervised learning model to fuse the rich structure and semantic information from different views. meanwhile, path detection is an extension of link prediction, yielding to better representation in link prediction. the emergence and rapid expansion of covid-19 have posed a global health threat. recent studies have demonstrated that the cytokine storm, namely the excessive inflammatory response, is a key factor leading to death in patients with covid-19. therefore, it is urgent and important to discover potential drugs that prevent the cytokine storm in covid-19 patients. meanwhile, it has been proven that interleukin(il)-6 is a potential target of antiinflammatory response, and drugs targeting il-6 are promising agents blocking cytokine storm for severe covid-19 patients (mehta et al. 2020 ). in the experiments, selfrl is used for drug repositioning for covid-19 disease which aim to discovery agents binding to il-6 for blocking cytokine storm in patients. the low-dimensional representation vectors generated by selfrl are fed into the imc algorithm for predicting the confidence scores between il-6 and each drug in neodti-net dataset. then, the top-10 agents with the highest confidence scores are selected as potential therapeutic agents for covid-19 patients. the 10 candidate drugs and their anti-inflammatory mechanisms of action in silico is shown in table 4 . the knowledge from pubmed publications demonstrates that nine out of ten drugs are able to reduce the release and express of il-6 for exerting anti-inflammatory effects in silico. meanwhile, there are three drugs (i.e., dasatinib, carvedilol, and indomethacin) that inhibit the release of il-6 by reducing the mrna levels of il-6. however, imatinib inhibits the function of human monocytes to prevent the expression of il-6. in addition, although the anti-inflammatory mechanisms of action of five agents (i.e., arsenic trioxide, irbesartan, amiloride, propranolol, sorafenib) are uncertain, these agents can still reduce the release or expression of il-6 for preforming anti-inflammatory effects. therefore, the top ten agents predicted by selfrl-based drug repositioning is able to be used for inhibiting cytokine storm in patients with covid-19, and should be taken into consideration in clinical studies on covid-19. these results further indicate that the proposed selfrl is a powerful network representation learning approach, and can facilitate the link prediction in hbns. in this study, selfrl uses transformer encoders to learn representation vectors by the proposed vertex entity mask and meta path detection tasks. meanwhile, the entity-and pathtable 5 : the dti and dda prediction result of selfrl and baseline methods on the neodti-net and deepdr-net datasets. the mlth and clth stand for the mean and concatenation values of representation from the last two hidden layers, respectively. atlre denotes the mean value of the two-level representation from the last hidden layer. table 5 . selfrl achieves the best performance. meanwhile, the results show that the two-level representation are superior to the single level representation. interestingly, the concatenation of vectors from the lth layers is beneficial to improving the link prediction performance compared to the mean value of the vectors from the lth layers for each level representation model. this is intuitive since two-level representation can fuse the structural and semantic information from different views in hbns. meanwhile, larger number of dimensions can provide more and richer information. this study proposes a two-level self-supervised representation learning, termed selfrl, for link prediction in heterogeneous biomedical networks. the proposed selfrl designs a meta path detection-based self-supervised learning task, and integrates vertices entity-level mask tasks to capture the rich structure and semantics from two-level views of hbns. the results of link prediction indicate that selfrl is superior to 25 state-of-the-art approaches on six datasets. in the future, we will design more self-supervised learning tasks with unable data to improve the representation performance of the model. in addition, we will also developed the effective multi-task learning framework in the proposed model. distributed large-scale natural graph factorization drug-target interaction prediction through domain-tuned network-based inference laplacian eigenmaps and spectral techniques for embedding and clustering laplacian eigenmaps for dimensionality reduction and data representation new directions for diffusion-based network prediction of protein function: incorporating pathways with confidence deep neural network for learning graph representations grarep: learning graph representations with global structural information network representation learning with rich text information support-vector networks a survey on network embedding bert: pre-training of deep bidirectional transformers for language understanding predicting drug target interactions using meta-pathbased semantic network analysis kernelized bayesian matrix factorization node2vec: scalable feature learning for networks provable inductive matrix completion attention based meta path fusion forheterogeneous information network embedding variational graph auto-encoders. arxiv:machine learning random forests deepcas: an end-to-end predictor of information cascades predicting drug-target interaction using a novel graph neural network with 3d structure-embedded graph representation a network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information covid-19: consider cytokine storm syndromes and immunosuppression. the lancet drug?target interaction prediction by learning from local information and neighbors distributed representations of words and phrases and their compositionality asymmetric transitivity preserving graph embedding deepwalk: online learning of social representations struc2vec: learning node representations from structural identity. in knowledge discovery and data mining nonlinear dimensionality reduction by locally linear embedding heterogeneous information network embedding for recommendation prediction and validation of gene-disease associations using methods inspired by social network analyses network embedding in biomedical data science pathsim: meta path-based top-k similarity search in heterogeneous information networks line: large-scale information network embedding neodti: neural integration of neighbor information from a heterogeneous network for discovering new drug-target interactions glue: a multi-task benchmark and analysis platform for natural language understanding structural deep network embedding graphgan: graph representation learning with generative adversarial nets shine:signed heterogeneous information network embedding for sentiment link prediction semisupervised drug-protein interaction prediction from heterogeneous biological spaces self-supervised learning: generative or contrastive. arxiv doi network representation learning-based drug mechanism discovery and anti-inflammatory response against a novel approach for drug response prediction in cancer cell lines via network representation learning graph embedding on biomedical networks: methods, applications and evaluations heterogeneous network representation learning using deep learning deepdr: a network-based deep learning approach to in silico drug repositioning key: cord-007415-d57zqixs authors: da fontoura costa, luciano; sporns, olaf; antiqueira, lucas; das graças volpe nunes, maria; oliveira, osvaldo n. title: correlations between structure and random walk dynamics in directed complex networks date: 2007-07-30 journal: appl phys lett doi: 10.1063/1.2766683 sha: doc_id: 7415 cord_uid: d57zqixs in this letter the authors discuss the relationship between structure and random walk dynamics in directed complex networks, with an emphasis on identifying whether a topological hub is also a dynamical hub. they establish the necessary conditions for networks to be topologically and dynamically fully correlated (e.g., word adjacency and airport networks), and show that in this case zipf’s law is a consequence of the match between structure and dynamics. they also show that real-world neuronal networks and the world wide web are not fully correlated, implying that their more intensely connected nodes are not necessarily highly active. in this letter the authors discuss the relationship between structure and random walk dynamics in directed complex networks, with an emphasis on identifying whether a topological hub is also a dynamical hub. they establish the necessary conditions for networks to be topologically and dynamically fully correlated ͑e.g., word adjacency and airport networks͒, and show that in this case zipf's law is a consequence of the match between structure and dynamics. they also show that real-world neuronal networks and the world wide web are not fully correlated, implying that their more intensely connected nodes are not necessarily highly active. © 2007 american institute of physics. ͓doi: 10.1063/1. 2766683͔ we address the relationship between structure and dynamics in complex networks by taking the steady-state distribution of the frequency of visits to nodes-a dynamical feature-obtained by performing random walks 1 along the networks. a complex network 2-5 is taken as a graph with directed edges and associated weights, which are represented in terms of the weight matrix w. the n nodes in the network are numbered as i =1,2, ... ,n, and a directed edge with weight m, extending from node j to node i, is represented as w͑i , j͒ = m. no self-connections ͑loops͒ are considered. the in and out strengths of a node i, abbreviated as is͑i͒ and os͑i͒, correspond to the sum of the weights of its in-and outbound connections, respectively. the stochastic matrix s for such a network is the matrix s is assumed to be irreducible; i.e., any of its nodes can be accessible from any other node, which allows the definition of a unique and stable steady state. an agent, placed at any initial node j, chooses among the adjacent outbound edges of node j with probability equal to s͑i , j͒. this step is repeated a large number of times t, and the frequency of visits to each node i is calculated as v͑i͒ = ͑number of visits during the walk͒ / t. in the steady state ͑i.e., after a long time period t͒, v = sv and the frequency of visits to each node along the random walk may be calculated in terms of the eigenvector associated with the unit eigenvalue ͑e.g., ref. 6͒. for proper statistical normalization we set ͚ p v͑p͒ = 1. the dominant eigenvector of the stochastic matrix has theoretically and experimentally been verified to be remarkably similar to the corresponding eigenvector of the weight matrix, implying that the adopted random walk model shares several features with other types of dynamics, including linear and nonlinear summations of activations and flow in networks. in addition to providing a modeling approach intrinsically compatible with dynamics involving successive visits to nodes by a single or multiple agents, such as is the case with world wide web ͑www͒ navigation, text writing, and transportation systems, random walks are directly related to diffusion. more specifically, as time progresses, the frequency of visits to each network node approaches the activity values which would be obtained by the traditional diffusion equation. a full congruence between such frequencies and activity diffusion is obtained at the equilibrium state of the random walk process. therefore, random walks are also directly related to the important phenomenon of diffusion, which plays an important role in a large number of linear and nonlinear dynamic systems including disease spreading and pattern formation. random walks are also intrinsically connected to markov chains, electrical circuits, and flows in networks, and even dynamical models such as ising. for such reasons, random walks have become one of the most important and general models of dynamics in physics and other areas, constituting a primary choice for investigating dynamics in complex networks. the correlations between activity ͑the frequency of visits to nodes v͒ and topology ͑out strength os or in strength is͒ can be quantified in terms of the pearson correlation coefficient r. for full activity-topology correlation in directed networks, i.e., ͉r͉ = 1 between v and os or between v and is, it is enough that ͑i͒ the network must be strongly connected, i.e., s is irreducible, and ͑ii͒ for any node, the in strength must be equal to the out strength. the proof of the statement above is as follows. because the network is strongly connected, its stochastic matrix s has a unit eigenvector in the steady state, i.e., v = sv. since s͑i , j͒ = w͑i , j͒ /os͑j͒, the ith element of the vector sos is given as by hypothesis, is͑i͒ =os͑i͒ for any i and, therefore, both os and is are eigenvectors of s associated with the unit eigenvalue. then os= is= v, implying full correlation between frequency of visits and both in and out strengths. an implication of this derivation is that for perfectly correlated networks, the frequency of symbols produced by random walks will be equal to the out strength or in strength distributions. therefore, an out strength scale-free 3 network must produce sequences obeying zipf's law 7 and vice versa. if, on the other hand, the node distribution is gaussian, the frequency of visits to nodes will also be a gaussian function; that is to say, the distribution of nodes is replicated in the node activation. although the correlation between node strength and random walk dynamics in undirected networks has been established before 8 ͑including full correlation 9,10 ͒, the findings reported here are more general since they are related to any directed weighted network, such as the www and the airport network. indeed, the correlation conditions for undirected networks can be understood as a particular case of the conditions above. a fully correlated network will have ͉r͉ = 1. we obtained r = 1 for texts by darwin 11 and wodehouse 12 and for the network of airports in the usa. 13 the word association network was obtained by representing each distinct word as a node, while the edges were established by the sequence of immediately adjacent words in the text after the removal of stopwords 14 and lemmatization. 15 more specifically, the fact that word u has been followed by word v, m times during the text, is represented as w͑v , u͒ = m. zipf's law is known to apply to this type of network. 16 the airport network presents a link between two airports if there exists at least one flight between them. the number of flights performed in one month was used as the strength of the edges. we obtained r for various real networks ͑table i͒, including the fully correlated networks mentioned above. to interpret these data, we recall that a small r means that a hub ͑large in or out strength͒ in topology is not necessarily a center of activity. notably, in all cases considered r is greater for the in strength than for the out strength. this may be understood with a trivial example of a node from which a high number of links emerge ͑implying large out strength͒ but which has only very few inbound links. this node, in a random walk model, will be rarely occupied and thus cannot be a center of activity, though it will strongly affect the rest of the network by sending activation to many other targets. understanding why a hub in terms of in strength may fail to be very active is more subtle. consider a central node receiving links from many other nodes arranged in a circle, i.e., the central node has a large in strength but with the surrounding nodes possessing small in strength. in other words, if a node i receives several links from nodes with low activity, this node i will likewise be fairly inactive. in order to further analyze the latter case, we may examine the correlations between the frequency of visits to each node i and the cumulative hierarchical in and out strengths of that node. the hierarchical degree 17-19 of a network node provides a natural extension of the traditional concept of node degree. the im-table i. number of nodes ͑no. nodes͒, number of edges ͑no. edges͒, means and standard deviations of the clustering coefficient ͑cc͒, cumulative hierarchical in strengths for levels 1-4 ͑is1-is4͒, cumulative hierarchical out strengths for levels 1-4 ͑os1-os4͒, and the pearson correlation coefficients between the activation and all cumulative hierarchical in strengths and out strengths ͑r is1r os4 ͒ for the complex networks considered in the present work. for the least correlated network analyzed, viz., that of the largest strongly connected cluster in the network of www links in the domain of ref. 21 ͑massey university, new zealand͒ ͑refs. 22 and 23͒ activity could not be related to in strength at any hierarchical level. because the pearson coefficient corresponds to a single real value, it cannot adequately express the coexistence of the many relationships between activity and degrees present in this specific network as well as possibly heterogeneous topologies. very similar results were obtained for other www networks, which indicate that the reasons why topological hubs have not been highly active cannot be identified at the present moment ͑see, however, discussion for higher correlated networks below͒. however, for the two neuronal structures of table i that are not fully correlated ͑network defined by the interconnectivity between cortical regions of the cat 24 and network of synaptic connections in c. elegans 25 ͒, activity was shown to increase with the cumulative first and second hierarchical in strengths. in the cat cortical network, each cortical region is represented as a node, and the interconnections are reflected by the network edges. significantly, in a previous paper, 26 it was shown that when connections between cortex and thalamus were included, the correlation between activity and outdegree increased significantly. this could be interpreted as a result of increased efficiency with the topological hubs becoming highly active. furthermore, for the fully correlated networks, such as word associations obtained for texts by darwin and wodehouse, activity increased basically with the square of the cumulative second hierarchical in strength ͑see supplementary fig. 2 . in ref. 20͒ . in addition, the correlations obtained for these two authors are markedly distinct, as the work of wodehouse is characterized by substantially steeper increase of frequency of visits for large in strength values ͑see supplementary fig. 3 in ref. 20͒. therefore, the results considering higher cumulative hierarchical degrees may serve as a feature for authorship identification. in conclusion, we have established ͑i͒ a set of conditions for full correlation between topological and dynamical features of directed complex networks and demonstrated that ͑ii͒ zipf's law can be naturally derived for fully correlated networks. result ͑i͒ is of fundamental importance for studies relating the dynamics and connectivity in networks, with critical practical implications. for instance, it not only demonstrates that hubs of connectivity may not correspond to hubs of activity but also provides a sufficient condition for achieving full correlation. result ͑ii͒ is also of fundamental importance as it relates two of the most important concepts in complex systems, namely, zipf's law and scale-free networks. even though sharing the feature of power law, these two key concepts had been extensively studied on their own. the result reported in this work paves the way for important additional investigations, especially by showing that zipf's law may be a consequence of dynamics taking place in scalefree systems. in the cases where the network is not fully correlated, the pearson coefficient may be used as a characterizing parameter. for a network with very small correlation, such as the www links between the pages in a new zealand domain analyzed here, the reasons for hubs failing to be active could not be identified, probably because of the substantially higher complexity and heterogeneity of this network, including varying levels of clustering coefficients, as compared to the neuronal networks. this work was financially supported by fapesp and cnpq ͑brazil͒. luciano da f. costa thanks grants 05/ 00587-5 ͑fapesp͒ and 308231/03-1 ͑cnpq͒. 1 markov chains: gibbs fields, monte carlo simulation, and queues ͑springer the formation of vegetable mould through the action of worms, with observations on their habits ͑murray the pothunters ͑a & c black bureau of transportation statistics: airline on-time performance data modern information retrieval ͑addison-wesley the oxford handbook of computational linguistics ͑oxford human behaviour and the principle of least effort ͑addison-wesley key: cord-010751-fgk05n3z authors: holme, petter title: objective measures for sentinel surveillance in network epidemiology date: 2018-08-15 journal: nan doi: 10.1103/physreve.98.022313 sha: doc_id: 10751 cord_uid: fgk05n3z assume one has the capability of determining whether a node in a network is infectious or not by probing it. then problem of optimizing sentinel surveillance in networks is to identify the nodes to probe such that an emerging disease outbreak can be discovered early or reliably. whether the emphasis should be on early or reliable detection depends on the scenario in question. we investigate three objective measures from the literature quantifying the performance of nodes in sentinel surveillance: the time to detection or extinction, the time to detection, and the frequency of detection. as a basis for the comparison, we use the susceptible-infectious-recovered model on static and temporal networks of human contacts. we show that, for some regions of parameter space, the three objective measures can rank the nodes very differently. this means sentinel surveillance is a class of problems, and solutions need to chose an objective measure for the particular scenario in question. as opposed to other problems in network epidemiology, we draw similar conclusions from the static and temporal networks. furthermore, we do not find one type of network structure that predicts the objective measures, i.e., that depends both on the data set and the sir parameter values. infectious diseases are a big burden to public health. their epidemiology is a topic wherein the gap between the medical and theoretical sciences is not so large. several concepts of mathematical epidemiology-like the basic reproductive number or core groups [1] [2] [3] -have entered the vocabulary of medical scientists. traditionally, authors have modeled disease outbreaks in society by assuming any person to have the same chance of meeting anyone else at any time. this is of course not realistic, and improving this point is the motivation for network epidemiology: epidemic simulations between people connected by a network [4] . one can continue increasing the realism in the contact patterns by observing that the timing of contacts can also have structures capable of affecting the disease. studying epidemics on time-varying contact structures is the basis of the emerging field of temporal network epidemiology [5] [6] [7] [8] . one of the most important questions in infectious disease epidemiology is to identify people, or in more general terms, units, that would get infected early and with high likelihood in an infectious outbreak. this is the sentinel surveillance problem [9, 10] . it is the aspect of node importance, which is the one most actively used in public health practice. typically, it works by selecting some hospitals (clinics, cattle farms, etc.) to screen, or more frequently test, for a specific infection [11] . defining an objective measure-a quantity to be maximized or minimized-for sentinel surveillance is not trivial. it depends on the particular scenario one considers and the means of interventions at hand. if the goal for society is to detect as many outbreaks as possible, it makes sense to choose sentinels to * holme@cns.pi.titech.ac.jp maximize the fraction of detected outbreaks [9] . if the objective rather is to discover outbreaks early, then one could choose sentinels that, if infected, are infected early [10, 12] . finally, if the objective is to stop the disease as early as possible, it makes sense to measure the time to extinction or detection (infection of a sentinel) [13] . see fig. 1 for an illustration. to restrict ourselves, we will focus on the case of one sentinel. if one has more than one sentinel, the optimal set will most likely not be the top nodes of a ranking according to the three measures above. their relative positions in the network also matter (they should not be too close to each other) [13] . in this paper, we study and characterize our three objective measures. we base our analysis on 38 empirical data sets of contacts between people. we analyze them both in temporal and static networks. the reason we use empirical contact data, rather than generative models, as the basis of this study is twofold. first, there are so many possible structures and correlations in temporal networks that one cannot tune them all in models [8] . it is also hard to identify the most important structures for a specific spreading phenomenon [8] . second, studying empirical networks makes this paper-in addition to elucidating the objective measures of sentinel surveillance-a study of human interaction. we can classify data sets with respect how the epidemic dynamics propagate on them. as mentioned above, in practical sentinel surveillance, the network in question is rather one of hospitals, clinics or farms. one can, however, also think of sentinel surveillance of individuals, where high-risk individuals would be tested extra often for some diseases. in the remainder of the paper, we will describe the objective measures, the structural measures we use for the analysis, and the data sets, and we will present the analysis itself. we will primarily focus on the relation between the measures, secondarily on the structural explanations of our observations. assume that the objective of society is to end outbreaks as soon as possible. if an outbreak dies by itself, that is fine. otherwise, one would like to detect it so it could be mitigated by interventions. in this scenario, a sensible objective measure would be the time for a disease to either go extinct or be detected by a sentinel: the time to detection or extinction t x [13] . suppose that, in contrast to the situation above, the priority is not to save society from the epidemics as soon as possible, but just to detect outbreaks fast. this could be the case if one would want to get a chance to isolate a pathogen, or start producing a vaccine, as early as possible, maybe to prevent future outbreaks of the same pathogen at the earliest possibility. then one would seek to minimize the time for the outbreak to be detected conditioned on the fact that it is detected: the time to detection t d . for the time to detection, it does not matter how likely it is for an outbreak to reach a sentinel. if the objective is to detect as many outbreaks as possible, the corresponding measure should be the expected frequency of outbreaks to reach a node: the frequency of detection f d . note that for this measure a large value means the node is a good sentinel, whereas for t x and t d a good sentinel has a low value. this means that when we correlate the measures, a similar ranking between t x and f d or t d and f d yields a negative correlation coefficient. instead of considering the inverse times, or similar, we keep this feature and urge the reader to keep this in mind. there are many possible ways to reduce our empirical temporal networks to static networks. the simplest method would be to just include a link between any pair of nodes that has at least one contact during the course of the data set. this would however make some of the networks so dense that the static network structure of the node-pairs most actively in contact would be obscured. for our purpose, we primarily want our network to span many types of network structures that can impact epidemics. without any additional knowledge about the epidemics, the best option is to threshold the weighted graph where an edge (i, j ) means that i and j had more than θ contacts in the data set. in this work, we assume that we do not know what the per-contact transmission probability β is (this would anyway depend on both the disease and precise details of the interaction). rather we scan through a very large range of β values. since we anyway to that, there is no need either to base the choice of θ on some epidemiological argument, or to rescale β after the thresholding. note that the rescaled β would be a non-linear function of the number of contacts between i and j . (assuming no recovery, for an isolated link with ν contacts, the transmission probability is 1 − (1 − β ) ν .) for our purpose the only thing we need is that the rescaled β is a monotonous function of β for the temporal network (which is true). to follow a simple principle, we omit all links with a weight less than the median weight θ . we simulate disease spreading by the sir dynamics, the canonical model for diseases that gives immunity upon recovery [2, 14] . for static networks, we use the standard markovian version of the sir model [15] . that is, we assume that diseases spread over links between susceptible and infectious nodes the infinitesimal time interval dt with a probability β dt. then, an infectious node recovers after a time that is exponentially distributed with average 1/ν. the parameters β and ν are called infection rate and recovery rate, respectively. we can, without loss of generality, put ν = 1/t (where t is the duration of the sampling). for other ν values, the ranking of the nodes would be the same (but the values of the t x and t d would be rescaled by a factor ν). we will scan an exponentially increasing progression of 200 values of β, from 10 −3 to 10. the code for the disease simulations can be downloaded [16] . for the temporal networks, we use a definition as close as possible to the one above. we assume an exponentially distributed duration of the infectious state with mean 1/ν. we assume a contact between an infectious and susceptible node results in a new infection with probability β. in the case of temporal networks, one cannot reduce the problem to one parameter. like for static networks, we sample the parameter values in exponential sequences in the intervals 0.01 β 1 and 0.01 ν/t 1 respectively. for temporal networks, with our interpretation of a contact, β > 1 makes no sense, which explains the upper limit. furthermore, since temporal networks usually are effectively sparser (in terms of the number of possible infection events per time), the smallest β values will give similar results, which is the reason for the higher cutoff in this case. for both temporal and static networks, we assume the outbreak starts at one randomly chosen node. analogously, in the temporal case we assume the disease is introduced with equal probability at any time throughout the sampling period. for every data set and set of parameter values, we sample 10 7 runs of epidemic simulations. as motivated in the introduction, we base our study on empirical temporal networks. all networks that we study record contacts between people and falls into two classes: human proximity networks and communication networks. proximity networks are, of course, most relevant for epidemic studies, but communication networks can serve as a reference (and it is interesting to see how general results are over the two classes). the data sets consist of anonymized lists of two identification numbers in contact and the time since the beginning of the contact. many of the proximity data sets we use come from the sociopatterns project [17] . these data sets were gathered by people wearing radio-frequency identification (rfid) sensors that detect proximity between 1 and 1.5 m. one such datasets comes from a conference, hypertext 2009, (conference 1) [18] , another two from a primary school (primary school) [19] and five from a high school (high school) [20] , a third from a hospital (hospital) [21] , a fourth set of five data sets from an art gallery (gallery) [22] , a fifth from a workplace (office) [23] , and a sixth from members of five families in rural kenya [24] . the gallery data sets consist of several days where we use the first five. in addition to data gathered by rfid sensors, we also use data from the longer-range (around 10m) bluetooth channel. the cambridge 1 [25] and 2 [26] datasets were measured by the bluetooth channel of sensors (imotes) worn by people in and around cambridge, uk. st andrews [27] , conference 2 [25] , and intel [25] are similar data sets tracing contacts at, respectively, the university of st. andrews, the conference infocom 2006, and the intel research laboratory in cambridge, uk. the reality [28] and copenhagen bluetooth [29] data sets also come from bluetooth data, but from smartphones carried by university students. in the romania data, the wifi channel of smartphones was used to log the proximity between university students [30] , whereas the wifi dataset links students of a chinese university that are logged onto the same wifi router. for the diary data set, a group of colleagues and their family members were self-recording their contacts [31] . our final proximity data, the prostitution network, comes from from self-reported sexual contacts between female sex workers and their male sex buyers [32] . this is a special form of proximity network since contacts represent more than just proximity. among the data sets from electronic communication, facebook comes from the wall posts at the social media platform facebook [33] . college is based on communication at a facebook-like service [34] . dating shows interactions at an early internet dating website [35] . messages and forum are similar records of interaction at a film community [36] . copenhagen calls and copenhagen sms consist of phone calls and text messages gathered in the same experiment as copenhagen bluetooth [29] . finally, we use four data sets of e-mail communication. one, e-mail 1, recorded all e-mails to and from a group of accounts [37] . the other three, e-mail 2 [38] , 3 [39] , and 4 [40] recorded e-mails within a set of accounts. we list basic statistics-sizes, sampling durations, etc.-of all the data sets in table i . to gain further insight into the network structures promoting the objective measures, we correlate the objective measures with quantities describing the position of a node in the static networks. since many of our networks are fragmented into components, we restrict ourselves to measures that are well defined for disconnected networks. otherwise, in our selection, we strive to cover as many different aspects of node importance as we can. degree is simply the number of neighbors of a node. it usually presented as the simplest measure of centrality and one of the most discussed structural predictors of importance with respect to disease spreading [42] . (centrality is a class of measures of a node's position in a network that try to capture what a "central" node is; i.e., ultimately centrality is not more well-defined than the vernacular word.) it is also a local measure in the sense that a node is able to estimate its degree, which could be practical when evaluating sentinel surveillance in real networks. subgraph centrality is based on the number of closed walks a node is a member of. (a walk is a path that could be overlapping itself.) the number of paths from node i to itself is given by a λ ii , where a is the adjacency matrix and λ is the length of the path. reference [43] argues that the best way to weigh paths of different lengths together is through the formula as mentioned, several of the data sets are fragmented (even though the largest connected component dominates components of other sizes). in the limit of high transmission table i. basic statistics of the empirical temporal networks. n is the number of nodes, c is the number of contacts, t is the total sampling time, t is the time resolution of the data set, m is the number of links in the projected and thresholded static networks, and θ is the threshold. probabilities, all nodes in the component of the infection seed will be infected. in such a case it would make sense to place a sentinel in the largest component (where the disease most likely starts). closeness centrality builds on the assumption that a node that has, on average, short distances to other nodes is central [44] . here, the distance d(i, j ) between nodes i and j is the number of links in the shortest paths between the nodes. the classical measure of closeness centrality of a node i is the reciprocal average distance between i and all other nodes. in a fragmented network, for all nodes, there will be some other node that it does not have a path to, meaning that the closeness centrality is ill defined. (assigning the distance infinity to disconnected pairs would give the closeness centrality zero for all nodes.) a remedy for this is, instead of measuring the reciprocal average of distances, measuring the average reciprocal distance [45] , where d −1 (i, j ) = 0 if i and j are disconnected. we call this the harmonic closeness by analogy to the harmonic mean. vitality measures are a class of network descriptor that capture the impact of deleting a node on the structure of the entire network [46, 47] . specifically, we measure the harmonic closeness vitality, or harmonic vitality, for short. this is the change of the sum of reciprocal distances of the graph (thus, by analogy to the harmonic closeness, well defined even for disconnected graphs): here the denominator concerns the graph g with the node i deleted. if deleting i breaks many shortest paths, then c c (i) decreases, and thus c v (i) increases. a node whose removal disrupts many shortest paths would thus score high in harmonic vitality. our sixth structural descriptor is coreness. this measure comes out of a procedure called k-core decomposition. first, remove all nodes with degree k = 1. if this would create new nodes with degree one, delete them too. repeat this until there are no nodes of degree 1. then, repeat the above steps for larger k values. the coreness of a node is the last level when it is present in the network during this process [48] . like for the static networks, in the temporal networks we measure the degree of the nodes. to be precise, we define the degree as the number of distinct other nodes a node in contact with within the data set. strength is the total number of contacts a node has participated in throughout the data set. unlike degree, it takes the number of encounters into account. temporal networks, in general, tend to be more disconnected than static networks. for node i to be connected to j in a temporal networks there has to be a time-respecting path from i to j , i.e., a sequence of contacts increasing in time that (if time is projected out) is a path from i to j [7, 8] . thus two interesting quantities-corresponding to the component sizes of static networks-are the fraction of nodes reachable from a node by time-respecting paths forward (downstream component size) and backward in time (upstream component size) [49] . if a node only exists in the very early stage of the data, the sentinel will likely not be active by the time the outbreak happens. if a node is active only at the end of the data set, it would also be too late to discover an outbreak early. for these reasons, we measure statistics of the times of the contacts of a node. we measure the average time of all contacts a node participates in; the first time of a contact (i.e., when the node enters the data set); and the duration of the presence of a node in the data (the time between the first and last contact it participates in). we use a version of the kendall τ coefficient [50] to elucidate both the correlations between the three objective measures, and between the objective measures and network structural descriptors. in its basic form, the kendall τ measures the difference between the number of concordant (with a positive slope between them) and discordant pairs relative to all pairs. there are a few different versions that handle ties in different ways. we count a pair of points whose error bars overlap as a tie and calculate where n c is the number of concordant pairs, n d is the number of discordant pairs, and n t is the number of ties. we start investigating the correlation between the three objective measures throughout the parameter space of the sir model for all our data sets. we use the time to detection and extinction as our baseline and compare the other two objective measures with that. in fig. 2 , we plot the τ coefficient between t x and t d and between t x and f d . we find that for low enough values of β, the τ for all objective measures coincide. for very low β the disease just dies out immediately, so the measures are trivially equal: all nodes would be as good sentinels in all three aspects. for slightly larger β-for most data sets 0.01 < β < 0.1-both τ (t x , t d ) and τ (t x , f d ) are negative. this is a region where outbreaks typically die out early. for a node to have low t x , it needs to be where outbreaks are likely to survive, at least for a while. this translates to a large f d , while for t d , it would be beneficial to be as central as possible. if there are no extinction events at all, t x and t d are the same. for this reason, it is no surprise that, for most of the data sets, τ (t x , t d ) becomes strongly positively correlated for large β values. the τ (t x , f d ) correlation is negative (of a similar magnitude), meaning that for most data sets the different methods would rank the possible sentinels in the same order. for some of the data sets, however, the correlation never becomes positive even for large β values (like copenhagen calls and copenhagen sms). these networks are the most fragmented onesm meaning that one sentinel unlikely would detect the outbreak (since it probably happens in another component). this makes t x rank the important nodes in a way similar to f d , but since diseases that do reach a sentinel do it faster in a small component than a large one, t x and t d become anticorrelated. in fig. 3 , we perform the same analysis as in the previous section but for static networks. the picture is to some extent similar, but also much richer. just as for the case of static networks, τ (t x , f d ) is always nonpositive, meaning the time to detection or extinction ranks the nodes in a way positively correlated with the frequency of detection. furthermore, like the static networks, τ (t x , t d ) can be both positively and negatively correlated. this means that there are regions where t d ranks the nodes in the opposite way than the t x . these regions of negative τ (t x , t d ) occur for low β and ν. for some data sets-for example the gallery data sets, dating, copenhagen calls, and copenhagen sms-the correlations are negative throughout the parameter space. among the data sets with a qualitative difference between the static and temporal representations, we find prostitution and e-mail 1 both have strongly positive values of τ (t x , t d ) for large β values in the static networks but moderately negative values for temporal networks. in this section, we take a look at how network structures affect our objective measures. in fig. 4 , we show the correlation between our three objective measures and the structural descriptors as a function of β for the office data set. panel (a) shows the results for the time to detection or extinction. there is a negative correlation between this measure and traditional centrality measures like degree or subgraph centrality. this is because t x is a quantity one wants to minimize to find the optimal sentinel, whereas for all the structural descriptors a large value means that a node is a candidate sentinel node. we see that degree and subgraph centrality are the two quantities that best predict the optimal sentinel location, while coreness is also close (at around −0.65). this in line with research showing that certain biological problems are better determined by degree than more elaborate centrality measures [51] . over all, the τ curves are rather flat. this is partly explained by τ being a rank correlation for t d [ fig. 4(b) ], most curves change behavior around β = 0.2. this is the region when larger outbreaks could happen, so one can understand there is a transition to a situation similar to t x [ fig. 4(a) ]. f d [fig. 4(c) ] shows a behavior similar to t d in that the curves start changing order, and what was a correlation at low β becomes an anticorrelation at high β. this anticorrelation is a special feature of this particular data set, perhaps due to its pronounced community structure. nodes of degree 0, 1, and 2 have a strictly increasing values of f d , but for some of the high degree nodes (that all have f d close to one) the ordering gets anticorrelated with degree which makes kendall's τ negative. since rank-based correlations are more principled for skew-distributed quantities common in networks, we keep them. we currently investigate what creates these unintuitive anticorrelations among the high degree nodes in this data set. next, we proceed with an analysis of all data sets. we summarize plots like fig. 4 by the structural descriptor with the largest magnitude of the correlation |τ |. see fig. 2 . we can see, that there is not one structural quantity that uniquely determines the ranking of nodes, there is not even one that dominates over (1) degree is the strongest structural determinant of all objective measures at low β values. this is consistent with ref. [13] . (2) component size only occurs for large β. in the limit of large β, f d is only determined by component size (if we would extend the analysis to even larger β, subgraph centrality would have the strongest correlation for the frequency of detection). (3) harmonic vitality is relatively better as a structural descriptor for t d , less so for t x and f d . t x and f d capture the ability of detecting an outbreak before it dies, so for these quantities one can imagine more fundamental quantities like degree and the component size are more important. (4) subgraph centrality often shows the strongest correlation for intermediate values of β. this is interesting, but difficult to explain since the rationale of subgraph centrality builds on cycle counts and there is no direct process involving cycles in the sir model. (5) harmonic closeness rarely gives the strongest correlation. if it does, it is usually succeeded by coreness and the data set is typically rather large. (6) datasets from the same category can give different results. perhaps college and facebook is the most conspicuous example. in general, however, similar data sets give similar results. the final observation could be extended. we see that, as β increases, one color tends to follow another. this is summarized in fig. 6 , where we show transition graphs of the different structural descriptors such that the size corresponds to their frequency in fig. 7 , and the size of the arrows show how often one structural descriptor is succeeded by another as β is increased. for t x , the degree and subgraph centrality are the most important structural descriptors, and the former is usually succeeded by the latter. for t d , there is a common peculiar sequence of degree, subgraph centrality, coreness component size, and harmonic vitality that is manifested as the peripheral, clockwise path of fig. 6(b) . finally, f d is similar to t x except that there is a rather common transition from degree to coreness, and harmonic vitality is, relatively speaking, a more important descriptor. in fig. 7 , we show the figure for temporal networks corresponding to fig. 5 . just like the static case, even though every data set and objective measure is unique, we can make some interesting observations. (1) strength is most important for small ν and β. this is analogous to degree dominating the static network at small parameter values. (2) upstream component size dominates at large ν and β. this is analogous to the component size of static networks. since temporal networks tend to be more fragmented than static ones [49] , this dominance at large outbreak sizes should be even more pronounced for temporal networks. (3) most of the variation happens in the direction of larger ν and β. in this direction, strength is succeeded by degree which is succeeded by upstream component size. (4) like the static case, and the analysis of figs. 5 and 7 , t x and f d are qualitatively similar compared to t d . (5) temporal quantities, such as the average and first times of a node's contacts, are commonly the strongest predictors of t d . (6) when a temporal quantity is the strongest predictor of t x and f d it is usually the duration. it is understandable that this has little influence on t d , since the ability to be infected at all matters for these measures; a long duration is beneficial since it covers many starting times of the outbreak. (7) similar to the static case, most categories of data sets give consistent results, but some differ greatly (facebook and college is yet again a good example). the bigger picture these observations paint is that, for our problem, the temporal and static networks behave rather similarly, meaning that the structures in time do not matter so much for our objective measures. at the same time, there is not only one dominant measure for all the data sets. rather are there several structural descriptors that correlate most strongly with the objective measures depending on ν and β. in this paper, we have investigated three different objective measures for optimizing sentinel surveillance: the time to detection or extinction, the time to detection (given that the detection happens), and the frequency of detection. each of these measures corresponds to a public health scenario: the time to detection or extinction is most interesting to minimize if one wants to halt the outbreak as quickly as possible, and the frequency of detection is most interesting if one wants to monitor the epidemic status as accurately as possible. the time to detection is interesting if one wants to detect the outbreak early (or else it is not important), which could be the case if manufacturing new vaccine is relatively time consuming. we investigate these cases for 38 temporal network data sets and static networks derived from the temporal networks. our most important finding is that, for some regions of parameter space, our three objective measures can rank nodes very differently. this comes from the fact that sir outbreaks have a large chance of dying out in the very early phase [52] , but once they get going they follow a deterministic path. for this reason, it is thus important to be aware of what scenario one is investigating when addressing the sentinel surveillance problem. another conclusion is that, for this problem, static and temporal networks behave reasonably similarly (meaning that the temporal effects do not matter so much). naturally, some of the temporal networks respond differently than the static ones, but compared to, e.g., the outbreak sizes or time to extinction [53] [54] [55] , differences are small. among the structural descriptors of network position, there is no particular one that dominates throughout the parameter space. rather, local quantities like degree or strength (for the temporal networks) have a higher predictive power at low parameter values (small outbreaks). for larger parameter values, descriptors capturing the number of nodes reachable from a specific node correlate most with the objective measures rankings. also in this sense, the static network quantities dominate the temporal ones, which is in contrast to previous observations (e.g., refs. [53] [54] [55] ). for the future, we anticipate work on the problem of optimizing sentinel surveillance. an obvious continuation of this work would be to establish the differences between the objective metrics in static network models. to do the same in temporal networks would also be interesting, although more challenging given the large number of imaginable structures. yet an open problem is how to distribute sentinels if there are more than one. it is known that they should be relatively far away [13] , but more precisely where should they be located? modern infectious disease epidemiology infectious diseases in humans temporal network epidemiology a guide to temporal networks principles and practices of public health surveillance stochastic epidemic models and their statistical analysis pretty quick code for regular (continuous time, markovian) sir on networks, github.com/pholme/sir proceedings, acm sigcomm 2006-workshop on challenged networks (chants) crawdad dataset st_andrews/sassy third international conference on emerging intelligent data and web technologies proc. natl. acad. sci. usa proceedings of the 2nd acm workshop on online social networks, wosn '09 proceedings of the tenth acm international conference on web search and data mining, wsdm '17 proceedings of the 14th international conference networks: an introduction network analysis: methodological foundations distance in graphs we thank sune lehmann for providing the copenhagen data sets. this work was supported by jsps kakenhi grant no. jp 18h01655. key: cord-048461-397hp1yt authors: coelho, flávio c; cruz, oswaldo g; codeço, cláudia t title: epigrass: a tool to study disease spread in complex networks date: 2008-02-26 journal: source code biol med doi: 10.1186/1751-0473-3-3 sha: doc_id: 48461 cord_uid: 397hp1yt background: the construction of complex spatial simulation models such as those used in network epidemiology, is a daunting task due to the large amount of data involved in their parameterization. such data, which frequently resides on large geo-referenced databases, has to be processed and assigned to the various components of the model. all this just to construct the model, then it still has to be simulated and analyzed under different epidemiological scenarios. this workflow can only be achieved efficiently by computational tools that can automate most, if not all, these time-consuming tasks. in this paper, we present a simulation software, epigrass, aimed to help designing and simulating network-epidemic models with any kind of node behavior. results: a network epidemiological model representing the spread of a directly transmitted disease through a bus-transportation network connecting mid-size cities in brazil. results show that the topological context of the starting point of the epidemic is of great importance from both control and preventive perspectives. conclusion: epigrass is shown to facilitate greatly the construction, simulation and analysis of complex network models. the output of model results in standard gis file formats facilitate the post-processing and analysis of results by means of sophisticated gis software. epidemic models describe the spread of infectious diseases in populations. more and more, these models are being used for predicting, understanding and developing control strategies. to be used in specific contexts, modeling approaches have shifted from "strategic models" (where a caricature of real processes is modeled in order to emphasize first principles) to "tactical models" (detailed representations of real situations). tactical models are useful for cost-benefit and scenario analyses. good examples are the foot-and-mouth epidemic models for uk, triggered by the need of a response to the 2001 epidemic [1, 2] and the simulation of pandemic flu in differ-ent scenarios helping authorities to choose among alternative intervention strategies [3, 4] . in realistic epidemic models, a key issue to consider is the representation of the contact process through which a disease is spread, and network models have arisen as good candidates [5] . this has led to the development of "network epidemic models". network is a flexible concept that can be used to describe, for example, a collection of individuals linked by sexual partnerships [6] , a collection of families linked by sharing workplaces/schools [7] , a collection of cities linked by air routes [8] . any of these scales may be relevant to the study and control of disease spread [9] . networks are made of nodes and their connections. one may classify network epidemic models according to node behavior. one example would be a classification based on the states assumed by the nodes: networks with discretestate nodes have nodes characterized by a discrete variable representing its epidemiological status (for example, susceptible, infected, recovered). the state of a node changes in response to the state of neighbor nodes, as defined by the network topology and a set of transmission rules. networks with continuous-state nodes, on the other hand, have node' state described by a quantitative variable (number of susceptibles, density of infected individuals, for example), modelled as a function of the history of the node and its neighbors. the importance of the concept of neighborhood on any kind of network epidemic model stems from its large overlap with the concept of transmission. in network epidemic models, transmission either defines or is defined/constrained by the neighborhood structure. in the latter case, a neighborhood structure is given a priori which will influence transmissibility between nodes. the construction of complex simulation models such as those used in network epidemic models, is a daunting task due to the large amount of data involved in their parameterization. such data frequently resides on large geo-referenced databases. this data has to be processed and assigned to the various components of the model. all this just to construct the model, then it still has to be simulated, analyzed under different epidemiological scenarios. this workflow can only be achieved efficiently by computational tools that can automate most if not all of these time-consuming tasks. in this paper, we present a simulation software, epigrass, aimed to help designing and simulating network-epidemic models with any kind of node behavior. without such a tool, implementing network epidemic models is not a simple task, requiring a reasonably good knowledge of programming. we expect that this software will stimulate the use and development of networks models for epidemiological purposes. the paper is organized as following: first we describe the software and how it is organized with a brief overview of its functionality. then we demonstrate its use with an example. the example simulates the spread of a directly transmitted infectious disease in brazil through its transportation network. the velocity of spread of new diseases in a network of susceptible populations depends on their spatial distribution, size, susceptibility and patterns of contact. in a spatial scale, climate and environment may also impact the dynamics of geographical spread as it introduces temporal and spatial heterogeneity. understanding and predicting the direction and velocity of an invasion wave is key for emergency preparedness. epigrass is a platform for network epidemiological simulation and analysis. it enables researchers to perform comprehensive spatio-temporal simulations incorporating epidemiological data and models for disease transmission and control in order to create complex scenario analyses. epigrass is designed towards facilitating the construction and simulation of large scale metapopulational models. each component population of such a metapopulational model is assumed to be connected through a contact network which determines migration flows between populations. this connectivity model can be easily adapted to represent any type of adjacency structure. epigrass is entirely written in the python language, which contributes greatly to the flexibility of the whole system due to the dynamical nature of the language. the geo-referenced networks over which epidemiological processes take place can be very straightforwardly represented in a object-oriented framework. consequently, the nodes and edges of the geographical networks are objects with their own attributes and methods (figure 1). once the archetypal node and edge objects are defined with appropriate attributes and methods, then a code representation of the real system can be constructed, where nodes (representing people or localities) and contact routes are instances of node and edge objects, respectively. the whole network is also an object with its own set of attributes and methods. in fact, epigrass also allows for multiple edge sets in order to represent multiple contact networks in a single model. figure 1 architecture of an epigrass simulation model. a simulation object contains the whole model and all other objects representing the graph, sites and edges. site object contaim model objects, which can be one of the built-in epidemiological models or a custom model written by the user. these features leads to a compact and hierarchical computational model consisting of a network object containing a variable number of node and edge objects. it also does not pose limitations to encapsulation, potentially allowing for networks within networks, if desirable. this representation can also be easily distributed over a computational grid or cluster, if the dependency structure of the whole model does not prevent it (this feature is currently being implemented and will be available on a future release of epigrass). for the end-user, this hierarchical, object-oriented representation is not an obstacle since it reflects the natural structure of the real system. even after the model is converted into a code object, all of its component objects remain accessible to one another, facilitating the exchange of information between all levels of the model, a feature the user can easily include in his/her custom models. nodes and edges are dynamical objects in the sense that they can be modified at runtime altering their behavior in response to user defined events. in epigrass it is very easy to simulate any dynamical system embedded in a network. however, it was designed with epidemiological models in mind. this goal led to the inclusion of a collection of built-in epidemic models which can be readily used for the intra-node dynamics (sir model family). epigrass users are not limited to basing their simulations on the built-in models. user-defined models can be developed in just a few lines of python code. all simulations in epigrass are done in discrete-time. however, custom models may implement finer dynamics within each time step, by implementing ode models at the nodes, for instance. the epigrass system is driven by a graphical user interface(gui), which handles several input files required for model definition and manages the simulation and output generation (figure 2). at the core of the system lies the simulator. it parses the model specification files, contained in a text file (.epg file), and builds the network from site and edge description files (comma separated values text files, csv). the simulator then builds a code representation of the entire model, simulates it, and stores the results in the database or in a couple of csv files. this output will contain the full time series of the variables in the model. additionally, a map layer (in shapefile and kml format) is also generated with summary statitics for the model (figure 3). the results of an epigrass simulation can be visualized in different ways. a map with an animation of the resulting timeseries is available directly through the gui (figure 4). other types of static visualizations can be generated through gis software from the shapefiles generated. the kml file can also be viewed in google earth™ or google maps™ (figure 5). epigrass also includes a report generator module which is controlled through a parameter in the ".epg" file. epigrass is capable of generating pdf reports with summary statistics from the simulation. this module requires a latex installation to work. reports are most useful for general verification of expected model behavior and network structure. however, the latex source files generated workflow for a typical epigrass simulation figure 3 workflow for a typical epigrass simulation. this diagram shows all inputs and outputs typical of an epigrass simulation session. epigrass graphical user interface figure 2 epigrass graphical user interface. by the module may serve as templates that the user can edit to generate a more complete document. building a model in epigrass is very simple, especially if the user chooses to use one of the built-in models. epigrass includes 20 different epidemic models ready to be used (see manual for built-in models description). to run a network epidemic model in epigrass, the user is required to provide three separate text files (optionally, also a shapefile with the map layer): 1. node-specification file: this file can be edited on a spreadsheet and saved as a csv file. each row is a node and the columns are variables describing the node. 2. edge-specification file: this is also a spreadsheet-like file with an edge per row. columns contain flow variables. 3. model-specification file: also referred to as the ".epg" file. this file specifies the epidemiological model to be run at the nodes, its parameters, flow model for the edges, and general parameters of the simulation. the ".epg" file is normally modified from templates included with epigrass. nodes and edges files on the other hand, have to be built from scratch for every new network. details of how to construct these files, as well as examples, can be found in the documentation accompanying the software, which is available at at the project's website [10] in the example application, the spread of a respiratory disease through a network of cities connected by bus transportation routes is analyzed. the epidemiological scenario is one of the invasion of a new influenza-like virus. one may want to simulate the spread of this disease through the country by the transportation network to evaluate alternative intervention strategies (e.g. different vaccination strategies). in this problem, a network can be defined as a set of nodes and links where nodes represent cities and links represents transportation routes. some examples of this kind of model are available in the literature [8, 11] . one possible objective of this model is to understand how the spread of such a disease may be affected by the pointof-entry of the disease in the network. to that end, we may look at variables such as the speed of the epidemic, number of cases after a fixed amount of time, the distribution of cases in time and the path taken by the spread. the example network was built from 76 of largest cities of brazil (>= 100 k habs). the bus routes between those cities formed the connections between the nodes of the networks. the number of edges in the network, derived from epigrass output visualized on google-earth figure 5 epigrass output visualized on google-earth. figure 4 epigrass animation output. sites are color coded (from red to blue) according to infection times. bright red is the seed site (on the ne). the bus routes, is 850. these bus routes are registered with the national agency of terrestrial transportation (antt) which provided the data used to parameterize the edges of the network. the epidemiological model used consisted of a metapopulation system with a discrete-time seir model (eq. 1). for each city, s t is the number of susceptibles in the city at time t, e t is the number of infected but not yet infectious individuals, i t is the number of infectious individuals resident in the locality, n is the population residing in the locality (assumed constant throughout the simulation), and n t is the number of individuals visiting the locality, θ t is the number of visitors who are infectious. the parameters used were taken from lipsitch et al. (2003) [12] to represent a disease like sars with an estimated basic reproduction number (r 0 ) of 2.2 to 3.6 ( table 1) . to simulate the spread of infection between cities, we used the concept of a "forest fire" model [13] . an infected individual, traveling to another city, acts as a spark that may trigger an epidemic in the new locality. this approach is based on the assumption that individuals commute between localities and contribute temporarily to the number of infected in the new locality, but not to its demography. implications of this approach are discussed in grenfell et al (2001) [13] . the number of individuals arriving in a city (n t ) is based on annual total number of passengers arriving trough all bus routes leading to that city as provided by the antt (brazilian national agency for terrestrial transportation). the annual number of passengers is used to derive an average daily number of passengers simply by dividing it by 365. stochasticity is introduced in the model at two points: the number of new cases is draw from a poisson distribution with intensity and the number of infected individuals visiting i is modelled as binomial process: where n is the total number of passengers arriving from a given neighboring city; i k, t and n k are the current number of infectious individuals and the total population size of city k, respectively. δ is the delay associated with the duration of each bus trip. the delay δ was calculated as the number of days (rounded down) that a bus, traveling at an average speed of 60 km/h, would take to complete a given trip. the lengths in kilometers of all bus routes were also obtained from the antt. vaccination campaigns in specific (or all) cities can be easily attained in epigrass, with individual coverages for each campaign on each city. we use this feature to explore vaccination scenarios in this model (figures 6 and 7). the files with this model's definition(the sites, edges and ".epg" files) are available as part of the additional files 1, 2 and 3 for this article. to determine the importance of the point of entry in the outcome of the epidemic, the model was run 500 times, randomizing the point of entry of the virus. the seeding site was chosen with a probability proportional to the log 10 of their population size. these replicates were run using epigrass' built-in support for repeated runs with the option of randomizing seeding site. for every simulation, statistics about each site such as the time it got infected and time series of incidence were saved. the time required for the epidemic to infect 50% of the cities was chosen as a global index to network susceptibility to invasion. to compare the relative exposure of cities to disease invasion, we also calculated the inverse of time , for all k neighbors elapsed from the beginning of the epidemic until the city registered its first indigenous case as a local measure of exposure. except for population size, all other epidemiological parameters were the same for all cities, that is, disease transmissibility and recovery rate. some positional features of each node were also derived: centrality, which is is a measure derived from the average distance of a given site to every other site in the network; betweeness, which is the number of times a node figures in the the shortest path between any other pair of nodes; and degree, which is the number of edges connected to a node. in order to analyze the path of the epidemic spread, we also recorded which cities provided the infectious cases which were responsible for the infection of each other city. if more than one source of infection exists, epigrass selects the city which contributed with the largest number cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at the highest degree city (são paulo) figure 6 cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at the highest degree city (são paulo). cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at a relatively low degree city(salvador) figure 7 cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at a relatively low degree city(salvador). of infectious individuals at that time-step, as the most likely infector. at the end of the simulation epigrass generates a file with the dispersion tree in graphml format, which can be read by a variety of graph plotting programs to generate the graphic seen on figure 8. the computational cost of running a single time step in an epigrass model, is mainly determined by the cost of calculating the epidemiological models on each site(node). therefore, time required to run models based on larger networks should scale linearly with the size of the network (order of the graph), for simulations of the same duration. the model presented here, took 2.6 seconds for a 100 days run, on a 2.1 ghz cpu. a somewhat larger model with 343 sites and 8735 edges took 28 seconds for a 100 days simulation. very large networks may be limited by the ammount of ram available. the authors are working on adapting epigrass to distribute processing among multiple cpus(in smp systems), or multiple computers in a cluster system. the memory demands can also be addressed by keeping the simulation objects on an objectoriented database during the simulation. steps in this direction are also being taken by the development team. the model presented here served maily the purpose of illustrating the capabilities of epigrass for simulating and analyzing reasonably complex epidemic scenarios. it should not be taken as a careful and complete analysis of a real epidemic. despite that, some features of the simulated epidemic are worth discussing. for example: the spread speed of the epidemic, measured as the time taken to infect 50% of the cities, was found to be influenced by the centrality and degree of the entry node (figures 9 and 10). the dispersion tree corresponding to the epidemic, is greatly influenced by the degree of the point of entry of spread of the epidemic starting at the city of salvador, a city with relatively small degree (that is, small number of neigh-bors) figure 8 spread of the epidemic starting at the city of salvador, a city with relatively small degree (that is, small number of neighbors). the number next to the boxes indicated the day when each city developed its first indigenous case. effect of degree(a) and betweeness(b) of entry node to the speed of the epidemic figure 9 effect of degree(a) and betweeness(b) of entry node to the speed of the epidemic. effect of betweeness of entry node on the speed of the epi-demic figure 10 effect of betweeness of entry node on the speed of the epidemic. the disease in the network. figure 8 shows the tree for the dispersion from the city of salvador. vaccination strategies must take into consideration network topology. figures 6 and 7 show cost benefit plots for three vaccination strategies investigated: uniform vaccination, top-3 degree sites only and top-10 degree sites only. vaccination of higher order sites offer cost/benefit advantages only in scenarios where the disease enter the network through one of these sites. epigrass facilitates greatly the simulation and analysis of complex network models. the output of model results in standard gis file formats facilitates the post-processing and analysis of results by means of sophisticated gis software. the non-trivial task of specifying the network over which the model will be run, is left to the user. but epigrass allows this structure to be provided as a simple list of sites and edges on text files, which can easily be contructed by the user using a spreadsheet, with no need for special software tools. besides invasion, network epidemiological models can also be used to understand patterns of geographical spread of endemic diseases [14] [15] [16] [17] . many infectious diseases can only be maintained in a endemic state in cities with population size above a threshold, or under appropriate environmental conditions(climate, availability of a reservoir, vectors, etc). the variables and the magnitudes associated with endemicity threshold depends on the natural history of the disease [18] . theses magnitudes may vary from place to place as it depends on the contact structure of the individuals. predicting which cities are sources for the endemicity and understanding the path of recurrent traveling waves may help us to design optimal surveillance and control strategies. modelling vaccination strategies against foot-and-mouth disease optimal reactive vaccination strategies for a foot-and-mouth outbreak in the uk strategy for distribution of influenza vaccine to high-risk groups and children containing pandemic influenza with antiviral agents space and contact networks: capturing the locality of disease transmission interval estimates for epidemic thresholds in two-sex network models applying network theory to epidemics: control measures for mycoplasma pneumoniae outbreaks assessing the impact of airline travel on the geographic spread of pandemic influenza modeling control strategies of respiratory pathogens epigrass website containing pandemic influenza at the source transmission dynamics and control of severe acute respiratory syndrome travelling waves and spatial hierarchies in measles epidemics travelling waves in the occurrence of dengue haemorrhagic fever in thailand modelling disease outbreaks in realistic urban social networks on the dynamics of flying insects populations controlled by large scale information large-scale spatial-transmission models of infectious disease disease extinction and community size: modeling the persistence of measles the authors would like to thank the brazilian research council (cnpq) for financial support to the authors. fcc contributed with the software development, model definition and analysis as well as general manuscript conception and writing. ctc contributed with model definition and implementation, as well as with writing the manuscript. ogc, contributed with data analysis and writing the manuscript. all authors have read and approved the final version of the manuscript. key: cord-241057-cq20z1jt authors: han, jungmin; cresswell-clay, evan c; periwal, vipul title: statistical physics of epidemic on network predictions for sars-cov-2 parameters date: 2020-07-06 journal: nan doi: nan sha: doc_id: 241057 cord_uid: cq20z1jt the sars-cov-2 pandemic has necessitated mitigation efforts around the world. we use only reported deaths in the two weeks after the first death to determine infection parameters, in order to make predictions of hidden variables such as the time dependence of the number of infections. early deaths are sporadic and discrete so the use of network models of epidemic spread is imperative, with the network itself a crucial random variable. location-specific population age distributions and population densities must be taken into account when attempting to fit these events with parametrized models. these characteristics render naive bayesian model comparison impractical as the networks have to be large enough to avoid finite-size effects. we reformulated this problem as the statistical physics of independent location-specific `balls' attached to every model in a six-dimensional lattice of 56448 parametrized models by elastic springs, with model-specific `spring constants' determined by the stochasticity of network epidemic simulations for that model. the distribution of balls then determines all bayes posterior expectations. important characteristics of the contagion are determinable: the fraction of infected patients that die ($0.017pm 0.009$), the expected period an infected person is contagious ($22 pm 6$ days) and the expected time between the first infection and the first death ($25 pm 8$ days) in the us. the rate of exponential increase in the number of infected individuals is $0.18pm 0.03$ per day, corresponding to 65 million infected individuals in one hundred days from a single initial infection, which fell to 166000 with even imperfect social distancing effectuated two weeks after the first recorded death. the fraction of compliant socially-distancing individuals matters less than their fraction of social contact reduction for altering the cumulative number of infections. the pandemic caused by the sars-cov-2 virus has swept across the globe with remarkable rapidity. the parameters of the infection produced by the virus, such as the infection rate from person-to-person contact, the mortality rate upon infection and the duration of the infectivity period are still controversial . parameters such as the duration of infectivity and predictions such as the number of undiagnosed infections could be useful for shaping public health responses as the predictive aspects of model simulations are possible guides to pandemic mitigation [7, 10, 20] . in particular, the possible importance of superspreaders should be understood [24] [25] [26] [27] . [5] had the insight that the early deaths in this pandemic could be used to find some characteristics of the contagion that are not directly observable such as the number of infected individuals. this number is, of course, crucial for public health measures. the problem is that standard epidemic models with differential equations are unable to determine such hidden variables as explained clearly in [6] . the early deaths are sporadic and discrete events. these characteristics imply that simulating the epidemic must be done in the context of network models with discrete dynamics for infection spread and death. the first problem that one must contend with is that even rough estimates of the high infection transmission rate and a death rate with strong age dependence imply that one must use large networks for simulations, on the order of 10 5 nodes, because one must avoid finite-size effects in order to accurately fit the early stochastic events. the second problem that arises is that the contact networks are obviously unknown so one must treat the network itself as a stochastic random variable, multiplying the computational time by the number of distinct networks that must be simulated for every parameter combination considered. the third problem is that there are several characteristics of sars-cov-2 infections that must be incorporated in any credible analysis, and the credibility of the analysis requires an unbiased sample of parameter sets. these characteristics are the strong age dependence of mortality of sars-cov-2 infections and a possible dependence on population density which should determine network connectivity in an unknown manner. thus the network nodes have to have location-specific population age distributions incorporated as node characteristics and the network connectivity itself must be a free parameter. 3 an important point in interpreting epidemics on networks is that the simplistic notion that there is a single rate at which an infection is propagated by contact is indefensible. in particular, for the sars-cov-2 virus, there are reports of infection propagation through a variety of mucosal interfaces, including the eyes. thus, while an infection rate must be included as a parameter in such simulations, there is a range of infection rates that we should consider. indeed, one cannot make sense of network connectivity without taking into account the modes of contact, for instance if an individual is infected during the course of travel on a public transit system or if an individual is infected while working in the emergency room of a hospital. one expects that network connectivity should be inversely correlated with infectivity in models that fit mortality data equally well but this needs to be demonstrated with data to be credible, not imposed by fiat. the effective network infectivity, which we define as the product of network connectivity and infection rate, is the parameter that needs to be reduced by either social distancing measures such as stay-at-home orders or by lowering the infection rate with mask wearing and hand washing. a standard bayesian analysis with these features is computationally intransigent. we therefore adopted a statistical physics approach to the bayesian analysis. we imagined a six-dimensional lattice of models with balls attached to each model with springs. each ball represents a location for which data is available and each parameter set determines a lattice point. the balls are, obviously, all independent but they face competing attractions to each lattice point. the spring constants for each model are determined by the variation we find in stochastic simulations of that specific model. one of the dimensions in the lattice of models corresponds to a median age parameter in the model. each location ball is attracted to the point in the median age parameter dimension that best matches that location's median age, and we only have to check that the posterior expectation of the median age parameter for that location's ball is close to the location's actual median age. thus we can decouple the models and the data simulations without having to simulate each model with the characteristics of each location, making the bayesian model comparison amenable to computation. finally, the distribution of location balls over the lattice determines the posterior expectation values of each parameter. we matched the outcomes of our simulations with data on the two week cumulative death counts after the first death using bayes' theorem to obtain parameter estimates for the infection dynamics. we used the bayesian model comparison to determine posterior expectation values for parameters for three distinct datasets. finally, we simulated the effects of various partially effective social-distancing measures on random networks and parameter sets given by the posterior expectation values of our bayes model comparison. we used data for the sars-cov-2 pandemic as compiled by [28] from the original data we generated random g(n, p = 2l/(n − 1)) networks of n = 90000 or 100000 nodes with an average of l links per node using the python package networkx [36] . scalepopdens ≡ l is one of the parameters that we varied. we compared the posterior expectation for this parameter for a location with the actual population density in an attempt to predict the appropriate way to incorporate measurable population densities in epidemic on network models [37, 38] . we used the python epidemics on networks package [39, 40] to simulate networks with specific parameter sets. we defined nodes to have status susceptible, infected, recovered or dead. we started each simulation with exactly one infected node, chosen at random. the simulation has two sorts of events: 1. an infected node connected to a susceptible node can change the status of the susceptible node to infected with an infection rate, infrate. this event is network connectivity dependent. therefore we expect to see a negative or inverse correlation between infrate and scalepopdens. 2. an infected node can transition to recovered status with a recovery rate, recrate, or transition to a dead status with a death rate, deathrate. both these rates are entirely node-autonomous. the reciprocal of the recrate parameter (recdays in the following) is the number of days an individual is contagious. we assigned an age to each node according to a probability distribution parametrized by the median age of each data set (country or state). as is well-known, there is a wide disparity in median ages in different countries. the probability distribution approximately models the triangular shape of the population pyramids that is observed in demographic studies. we parametrized it as a function of age a as follows: here medianage is the median age of a specific country, maxage = 100 y is a global maxiit is computationally impossible to perform model simulations for the exact age distribution for each location. we circumvented this problem, as detailed in the next subsection (bayes setup), by incorporating a scalemedage parameter in the model, scaled so that scalemedage = 1.0 corresponds to a median age of 40 years. the node age is used to make the deathrate of any node age-specific in the form of an age-dependent weight: where a[n] is the age of node n and ageincrease = 5.5 is an age-dependence exponent. w(a) is normalized so that a w(a|ageincrease)p (a|medianage = 38.7y) = 1, using the median age of new york state's population as the value of ageincrease given above was approximately determined by fitting to the observed age-specific mortality statistics of new york state [35] . however, we included ageincrease as a model parameter since the strong age dependence of sars-cov-2 mortality is not well understood, with the normalization adjusted appropriately as a function of ageincrease. note that a decrease in the median age with all rates and the age-dependence exponent held constant will lead to a lower number of deaths. we use simulations to find the number of dead nodes as a function of time. the first time at which a death occurs following the initial infection in the network is labeled time-firstdeath. figure close to its actual median age. we implemented bayes' theorem as usual. the probability of a model, m, given a set of after the first death did not affect our results. as alluded to in the previous subsection, the posterior expectation of the scalemedage parameter (×40 y) for each location should turn out to be close to the actual median age for each location in our setup, and this was achieved (right column, figure 5 ). we simulated our grid of models on the nih biowulf cluster. our grid comprised of 56448 ×2 parametrized models simulated with 40 random networks each and parameters in all possible combinations from the following lists: parameters. in particular, note that the network infectivity (infcontacts) has a factor of two smaller uncertainty than either of its factors as these two parameters (infrate and scalepopdens) cooperate in the propagation of the contagion and therefore turn out to have a negative posterior weighted correlation coefficient (table i ). the concordance of posterior expectation values (table i) this goes along with the approximately 80 day period between the first infection and the first death for a few outlier trajectories. however, it is also clear from the histograms in figure 9 and the mean timefirstdeath given in table i that the likely value of this duration is considerably shorter. finally, we evaluated a possible correlation between the actual population density and the scalepopdens parameter governing network connectivity. we found a significant correlation when we added additional countries to the european union countries in this regression, we obtained (p < 0.0019, r = 0.33) scalepopdens(us&eu+) = 0.11 ln(population per km 2 ) + 2.9. while epidemiology is not the standard stomping ground of statistical physics, bayesian model comparison is naturally interpreted in a statistical physics context. we showed that taking this interpretation seriously leads to enormous reductions in computational effort. given the complexity of translating the observed manifestations of the pandemic into an understanding of the virus's spread and the course of the infection, we opted for a simple data-driven approach, taking into account population age distributions and the age dependence of the death rate. while the conceptual basis of our approach is simple, there were computational difficulties we had to overcome to make the implementation amenable to computability with finite computational resources. our results were checked to not depend on the size of the networks we simulated, on the number of stochastic runs we used for each model, nor on the number of days that we used for the linear regression. all the values we report in table i are well within most estimated ranges in the literature but with the benefit of uncertainty estimates performed with a uniform model prior. while each location ball may range over a broad distribution of models, the consensus posterior distribution (table i) shows remarkable concordance across datasets. we can predict the posterior distribution of time of initial infection, timefirstdeath, as shown in table i . the dynamic model can predict the number of people infected after 21 the first infection (right panel, figure 10 ) and relative to the time of first death (left panel, figure 10 ) because we made no use of infection or recovery statistics in our analysis [9] . note the enormous variation in the number of infections for the same parameter set, only partly due to stochasticity of the networks themselves, as can be seen by comparing the upper and lower rows of figure 4 . with parameters intrinsic to the infection held fixed, we can predict the effect of various degrees of social distancing by varying network connectivity. we assumed that a certain fraction of nodes in the network would comply with social distancing and only these compliant nodes would reduce their connections at random by a certain fraction. figure 12 shows the effects of four such combinations of compliant node fraction and fraction of con(table ii) with the posterior expectations of parameters (table i) shows that the bayes entropy of the model posterior distribution is an important factor to consider, validating our initial intuition that optimization of model parameters would be inappropriate in this analysis. the regression we found (eq.'s 6, 7, 8) with respect to population density must be considered in light of the fact that many outbreaks are occurring in urban areas so they are not necessarily reflective of the true population density dependence. furthermore, we did not find a significant regression for the countries of the european union by themselves, perhaps because they have a smaller range of population densities, though the addition of these countries into the us states data further reduced the regression p-value of the null hypothesis without materially altering regression parameters. detailed epidemiological data could be used to clarify its significance. [ [24] [25] [26] [27] have suggested the importance of super-spreader events but we did not encounter any difficulty in modeling the available data with garden variety g(n, p) networks. certainly if the network has clusters of older nodes, there will be abrupt jumps in the cumulative death count as the infection spreads through the network. furthermore, it would be interesting to consider how to make the basic model useful for more heterogenous datasets such as all countries of the world with vastly different reporting of death statistics. using the posterior distribution we derived as a starting point for more complicated models may be an approach worth investigating. infectious disease modeling is a deep field with many sophisticated approaches in use [39, [41] [42] [43] and, clearly, our analysis is only scratching the surface of the problem at hand. network structure, in particular, is a topic that has received much attention in social network research [37, 38, [44] [45] [46] . bayesian approaches have been used in epidemics on networks modeling [47] and have also been used in the present pandemic context in [2, 27, 48] . to our knowledge, there is no work in the published literature that has taken the approach adopted in this paper. there are many caveats to any modeling attempt with data this heterogenous and complex. first of all, any model is only as good as the data incorporated and unreported sars-cov-2 deaths would impact the validity of our results. secondly, if the initial deaths occur in specific locations such as old-age care centers, our modeling will over-estimate the death rate. a safeguard against this is that the diversity of locations we used may compensate to a limited extent. detailed analysis of network structure from contact tracing can be used to correct for this if such data is available, and our posterior model probabilities could guide such refinement. thirdly, while we ensured that our results did not depend on our model ranges as far as practicable, we cannot guarantee that a model with parameters outside our ranges could not be a more accurate model. the transparency of our analysis and the simplicity of our assumptions may be helpful in this regard. all code is available 23 an seir infectious disease model with testing and conditional quarantine the lancet infectious diseases the lancet infectious diseases the lancet infectious diseases proceedings of the 7th python in science conference 2015 winter simulation conference (wsc) agent-based modeling and network dynamics infectious disease modeling charting the next pandemic: modeling infectious disease spreading in the data science age we are grateful to arthur sherman for helpful comments and questions and to carson chow for prepublication access to his group's work [6] . this work was supported by the key: cord-024346-shauvo3j authors: kruglov, vasiliy n. title: using open source libraries in the development of control systems based on machine vision date: 2020-05-05 journal: open source systems doi: 10.1007/978-3-030-47240-5_7 sha: doc_id: 24346 cord_uid: shauvo3j the possibility of the boundaries detection in the images of crushed ore particles using a convolutional neural network is analyzed. the structure of the neural network is given. the construction of training and test datasets of ore particle images is described. various modifications of the underlying neural network have been investigated. experimental results are presented. when processing crushed ore mass at ore mining and processing enterprises, one of the main indicators of the quality of work of both equipment and personnel is the assessment of the size of the crushed material at each stage of the technological process. this is due to the need to reduce material and energy costs for the production of a product unit manufactured by the plant: concentrate, sinter or pellets. the traditional approach to the problem of evaluating the size of crushed material is manual sampling with subsequent sieving with sieves of various sizes. the determination of the grain-size distribution of the crushed material in this way entails a number of negative factors: -the complexity of the measurement process; -the inability to conduct objective measurements with sufficient frequency; -the human error factor at the stages of both data collection and processing. these shortcomings do not allow you to quickly adjust the performance of crushing equipment. the need for obtaining data on the coarseness of crushed material in real time necessitated the creation of devices for in situ assessment of parameters such as the grain-size distribution of ore particles, weight-average ore particle and the percentage of the targeted class. the machine vision systems are able to provide such functionality. they have high reliability, performance and accuracy in determining the geometric dimensions of ore particles. at the moment, several vision systems have been developed and implemented for the operational control of the particle size distribution of crushed or granular material. in [9] , a brief description and comparative analysis of such systems as: split, wipfrag, fragscan, cias, ipacs, tucips is given. common to the algorithmic part of these systems is the stage of dividing the entire image of the crushed ore mass into fragments corresponding to individual particles with the subsequent determination of their geometric sizes. such a segmentation procedure can be solved by different approaches, one of which is to highlight the boundaries between fragments of images of ore particles. classical methods for borders highlighting based on the assessment of changes in brightness of neighboring pixels, which implies the use of mathematical algorithms based on differentiation [4, 8] . figure 1 shows typical images of crushed ore moving on a conveyor belt. algorithms from the opencv library, the sobel and canny filters in particular, used to detect borders on the presented images, have identified many false boundaries and cannot be used in practice. this paper presents the results of recognizing the boundaries of images of stones based on a neural network. this approach has been less studied and described in the literature, however, it has recently acquired great significance in connection with its versatility and continues to actively develop with the increasing of a hardware performance [3, 5] . to build a neural network and apply machine learning methods, a sample of images of crushed ore stones in gray scale was formed. the recognition of the boundaries of the ore particles must be performed for stones of arbitrary size and configuration on a video frame with ratio 768 × 576 pixels. to solve this problem with the help of neural networks, it is necessary to determine what type of neural network to use, what will be the input information and what result we want to get as the output of the neural network processing. analysis of literary sources showed that convolutional neural networks are the most promising when processing images [3, [5] [6] [7] . convolutional neural network is a special architecture of artificial neural networks aimed at efficient pattern recognition. this architecture manages to recognize objects in images much more accurately, since, unlike the multilayer perceptron, two-dimensional image topology is considered. at the same time, convolutional networks are resistant to small displacements, zooming, and rotation of objects in the input images. it is this type of neural network that will be used in constructing a model for recognizing boundary points of fragments of stone images. algorithms for extracting the boundaries of regions as source data use image regions having sizes of 3 × 3 or 5 × 5. if the algorithm provides for integration operations, then the window size increases. an analysis of the subject area for which this neural network is designed (a cascade of secondary and fine ore crushing) showed: for images of 768 × 576 pixels and visible images of ore pieces, it is preferable to analyze fragments with dimensions of 50 × 50 pixels. thus, the input data for constructing the boundaries of stones areas will be an array of images consisting of (768 − 50)*(576 − 50) = 377668 halftone fragments measuring 50 × 50 pixels. in each of these fragments, the central point either belongs to the boundary of the regions or not. based on this assumption, all images can be divided into two classes. to mark the images into classes on the source images, the borders of the stones were drawn using a red line with a width of 5 pixels. this procedure was performed manually with the microsoft paint program. an example of the original and marked image is shown in fig. 2 . then python script was a projected, which processed the original image to highlight 50 × 50 pixels fragments and based on the markup image sorted fragments into classes preserving them in different directories to write the scripts, we used the python 3 programming language and the jupyter notebook ide. thus, two data samples were obtained: training dataset and test dataset for the assessment of the network accuracy. as noted above, the architecture of the neural network was built on a convolutional principle. the structure of the basic network architecture is shown in fig. 3 [7] . the network includes an input layer in the format of the tensor 50 × 50 × 1. the following are several convolutional and pooling layers. after that, the network unfolds in one fully connected layer, the outputs of which converge into one neuron, to which the activation function, the sigmoid, will be applied. at the output, we obtain the probability that the center point of the input fragment belongs to the "boundary point" class. the keras open source library was used to develop and train a convolutional neural network [1, 2, 6, 10] . the basic convolutional neural network was trained with the following parameters: -10 epoch; -error -binary cross-entropy; -quality metric -accuracy (percentage of correct answers); -optimization algorithm -rmsprop. the accuracy on the reference data set provided by the base model is 90.8%. in order to improve the accuracy of predictions, a script was written that trains models on several configurations, and also checks the quality of the model on a test dataset. to improve the accuracy of the predictions of the convolutional neural network, the following parameters were varied with respect to the base model: -increasing the number of layers: +1 convolutional +1 pooling; -increasing of the number of filters: +32 in each layer; -increasing the size of the filter up to 5*5; -increasing the number of epochs up to 30; -decreasing in the number of layers. these modifications of the base convolutional neural network did not lead to an improvement in its performance -all models had the worst quality on the test sample (in the region of 88-90% accuracy). the model of the convolutional neural network, which showed the best quality, was the base model. its quality in the training sample is estimated at 90.8%, and in the test sample -at 83%. none of the other models were able to surpass this figure. data on accuracy and epoch error are shown in fig. 4 and 5 . if you continue to study for more than 10 epochs, then the effect of retraining occurs: the error drops, and accuracy increases only on training samples, but not on test ones. figure 6 shows examples of images with neural network boundaries. as you can see from the images, not all the borders are closed. the boundary discontinuities are too large to be closed using morphological operations on binary masks; however, the use of the "watershed" algorithm [8] will reduce the identification error of the boundary points. in this work, a convolutional neural network was developed and tested to recognize boundaries on images of crushed ore stones. for the task of constructing a convolutional neural network model, two data samples were generated: training and test dataset. when building the model, the basic version of the convolutional neural network structure was implemented. in order to improve the quality of model recognition, a configuration of various models was devised with deviations from the basic architecture. an algorithm for training and searching for the best model by enumerating configurations was implemented. in the course of the research, it was found that the basic model has the best quality for recognizing boundary points. it shows the accuracy of the predictions for the targeted class at 83%. based on the drawn borders on the test images, it can be concluded that the convolutional neural network is able to correctly identify the boundary points with a high probability. it rarely makes mistakes for cases when there is no boundary (false positive), but often makes mistakes when recognizing real boundary points (false negative). the boundary breaks are too large to be closed using morphological operations on binary masks, however, the use of the "watershed" algorithm will reduce the identification error for boundary points. funding. the work was performed under state contract 3170γc1/48564, grant from the fasie. keras: the python deep learning library deep learning with python, 1st edn machine learning: the art and science of algorithms that make sense of data digital image processing hands-on machine learning with scikit-learn and tensorflow: concepts, tools, and techniques to build intelligent systems deep learning with keras: implement neural networks with keras on theano and tensorflow comprehensive guide to convolutional neural networks -the eli5 way image processing, analysis and machine vision identifying, visualizing, and comparing regions in irregularly spaced 3d surface data python data science handbook: essential tools for working with data, 1st edn key: cord-027463-uc0j3fyi authors: brandi, giuseppe; di matteo, tiziana title: a new multilayer network construction via tensor learning date: 2020-05-25 journal: computational science iccs 2020 doi: 10.1007/978-3-030-50433-5_12 sha: doc_id: 27463 cord_uid: uc0j3fyi multilayer networks proved to be suitable in extracting and providing dependency information of different complex systems. the construction of these networks is difficult and is mostly done with a static approach, neglecting time delayed interdependences. tensors are objects that naturally represent multilayer networks and in this paper, we propose a new methodology based on tucker tensor autoregression in order to build a multilayer network directly from data. this methodology captures within and between connections across layers and makes use of a filtering procedure to extract relevant information and improve visualization. we show the application of this methodology to different stationary fractionally differenced financial data. we argue that our result is useful to understand the dependencies across three different aspects of financial risk, namely market risk, liquidity risk, and volatility risk. indeed, we show how the resulting visualization is a useful tool for risk managers depicting dependency asymmetries between different risk factors and accounting for delayed cross dependencies. the constructed multilayer network shows a strong interconnection between the volumes and prices layers across all the stocks considered while a lower number of interconnections between the uncertainty measures is identified. network structures are present in different fields of research. multilayer networks represent a widely used tool for representing financial interconnections, both in industry and academia [1] and has been shown that the complex structure of the financial system plays a crucial role in the risk assessment [2, 3] . a complex network is a collection of connected objects. these objects, such as stocks, banks or institutions, are called nodes and the connections between the nodes are called edges, which represent their dependency structure. multilayer networks extend the standard networks by assembling multiple networks 'layers' that are connected to each other via interlayer edges [4] and can be naturally represented by tensors [5] . the interlayer edges form the dependency structure between different layers and in the context of this paper, across different risk factors. however, two issues arise: 1 the construction of such networks is usually based on correlation matrices (or other symmetric dependence measures) calculated on financial asset returns. unfortunately, such matrices being symmetric, hide possible asymmetries between stocks. 2 multilayer networks are usually constructed via contemporaneous interconnections, neglecting the possible delayed cause-effect relationship between and within layers. in this paper, we propose a method that relies on tensor autoregression which avoids these two issues. in particular, we use the tensor learning approach establish in [6] to estimate the tensor coefficients, which are the building blocks of the multilayer network of the intra and inter dependencies in the analyzed financial data. in particular, we tackle three different aspects of financial risk, i.e. market risk, liquidity risk, and future volatility risk. these three risk factors are represented by prices, volumes and two measures of expected future uncertainty, i.e. implied volatility at 10 days (iv10) and implied volatility at 30 days (iv30) of each stock. in order to have stationary data but retain the maximum amount of memory, we computed the fractional difference for each time series [7] . to improve visualization and to extract relevant information, the resulting multilayer is then filtered independently in each dimension with the recently proposed polya filter [8] . the analysis shows a strong interconnection between the volumes and prices layers across all the stocks considered while a lower number of interconnection between the volatility at different maturity is identified. furthermore, a clear financial connection between risk factors can be recognized from the multilayer visualization and can be a useful tool for risk assessment. the paper is structured as follows. section 2 is devoted to the tensor autoregression. section 3 shows the empirical application while sect. 4 concludes. tensor regression can be formulated in different ways: the tensor structure is only in the response or the regression variable or it can be on both. the literature related to the first specification is ample [9, 10] whilst the fully tensor variate regression received attention only recently from the statistics and machine learning communities employing different approaches [6, 11] . the tensor regression we are going to use is the tucker tensor regression proposed in [6] . the model is formulated making use of the contracted product, the higher order counterpart of matrix product [6] and can be expressed as: where x ∈ r n ×i1×···×in is the regressor tensor, y ∈ r n ×j1×···×jm is the response tensor, e ∈ r n ×j1×···×jm is the error tensor, a ∈ r 1×j1×···×jm is the intercept tensor while the slope coefficient tensor, which represents the multilayer network we are interested to learn, is b ∈ r i1×···×in ×j1×···×jm . subscripts i x and j b are the modes over winch the product is carried out. in the context of this paper, x is a lagged version of y, hence b represents the multilinear interactions that the variables in x generate in y. these interactions are generally asymmetric and take into account lagged dependencies being b the mediator between two separate in time tensor datasets. therefore, b represents a perfect candidate to use for building a multilayer network. however, the b coefficient is high dimensional. in order to resolve the issue, a tucker structure is imposed on b such that it is possible to recover the original b with smaller objects. 1 one of the advantages of the tucker structure is, contrarily to other tensor decompositions such as the parafac, that it can handle dimension asymmetric tensors since each factor matrix does not need to have the same number of components. tensor regression is prone to over-fitting when intra-mode collinearity is present. in this case, a shrinkage estimator is necessary for a stable solution. in fact, the presence of collinearity between the variables of the dataset degrades the forecasting capabilities of the regression model. in this work, we use the tikhonov regularization [12] . known also as ridge regularization, it rewrites the standard least squares problem as where λ > 0 is the regularization parameter and 2 f is the squared frobenius norm. the greater the λ the stronger is the shrinkage effect on the parameters. however, high values of λ increase the bias of the tensor coefficient b. indeed, the shrinkage parameter is usually set via data driven procedures rather than input by the user. the tikhonov regularization can be computationally very expensive for big data problem. to solve this issue, [13] proposed a decomposition of the tikhonov regularization. the learning of the model parameters is a nonlinear optimization problem that can be solved by iterative algorithms such as the alternating least squares (als) introduced by [14] for the tucker decomposition. this methodology solves the optimization problem by dividing it into small least squares problems. recently, [6] developed an als algorithm for the estimation of the tensor regression parameters with tucker structure in both the penalized and unpenalized settings. for the technical derivation refer to [6] . in this section, we show the results of the construction of the multilayer network via the tensor regression proposed in eq. 1. the dataset used in this paper is composed of stocks listed in the dow jones (dj). these stocks time series are recorded on a daily basis from 01/03/1994 up to 20/11/2019, i.e. 6712 trading days. we use 26 over the 30 listed stocks as they are the ones for which the entire time series is available. for the purpose of our analysis, we use log-differenciated prices, volumes, implied volatility at 10 days (iv10) and implied volatility at 30 days (iv30). in particular, we use the fractional difference algorithm of [7] to balance stationarity and residual memory in the data. in fact, the original time series have the full amount of memory but they are non-stationary while integer log-differentiated data are stationary but have small residual memory due to the process of differentiation. in order to preserve the maximum amount of memory in the data, we use the fractional differentiation algorithm with different levels of fractional differentiation and then test for stationarity using the augmented dickey-fuller test [15] . we find that all the data are stationary when the order of differentiation is α = 0.2. this means that only a small amount of memory is lost in the process of differentiation. the tensor regression presented in eq. 1 has some parameters to be set, i.e. the tucker rank and the shrinkage parameter λ for the penalized estimation of eq. 2 as discussed in [6] . regarding the tucker rank, we used the full rank specification since we do not want to reduce the number of independent links. in fact, using a reduced rank would imply common factors to be mapped together, an undesirable feature for this application. regarding the shrinkage parameter λ, we selected the value as follows. first, we split the data in a training set composed of 90% of the sample and in a test set with the remaining 10%. we then estimated the regression coefficients for different values of λ on the training set and then we computed the predicted r 2 on the test set. we used a grid of λ = 0, 1, 5, 10, 20, 50. and the predicted r 2 is maximized at λ = 0 (no shrinkage). in this section, we show the results of the analysis carried out with the data presented in sect. 3.1. the multilayer network built via the estimated tensor autoregression coefficient b represents the interconnections between and within each layer. in particular b i,j,k,l is the connection between stock i in layer j and stock k in layer l. it is important to notice that the estimated dependencies are in general not symmetric, i.e. b i,j,k,l = b k,j,i,l . however, the multilayer network constructed using b is fully connected. for this reason, a method for filtering those networks is necessary. different methodologies are available for filtering information from complex networks [8, 16] . in this paper, we use the polya filter of [8] as it can handle directed weighted networks and it is both flexible and statistically driven. in fact, it employs a tuning parameter a that drives the strength of the filter and returns the p-values for the null hypotheses of random interactions. we filter every network independently (both intra and inter connections) using a parametrization such that 90% of the total links are removed. 2 in order to asses the dependency across the layers, we analyze two standard multilayer network measures, i.e. inter-layer assortativity and edge overlapping. a standard way to quantify inter-layer assortativity is to calculate pearson's correlation coefficient over degree sequences of two layers and it represents a measure of association between layers. high positive (negative) values of such measure mean that the two risk factors act in the same (opposite) direction. instead, overlapping edges are the links between pair of stocks present contemporaneously in two layers. high values of such measure mean that the stocks have common connections behaviour. as it can be possible to see from fig. 1 , prices and volatility have a huge portion of overlapping edges, still, these layers are disassortative as the correlation between the nodes sequence across the two layer is negative. this was an expected result since the negative relationship between prices and volatility is a stylized fact in finance. not surprisingly, the two measures of volatility are highly assortative and have a huge fraction of overlapping edges. finally, we show in fig. 2 the filtered multilayer network constructed via the tensor coefficient b estimated via the tensor autoregression of eq. 1. as it can be possible to notice, the volumes layer has more interlayer connections rather than intralayer connections. since each link represents the effect that one variable has on itself and other variables in the future, this means that stocks' liquidity risk mostly influences future prices and expected uncertainty. the two volatility networks have a relatively small number of interlayer connections despite being assortative. this could be due to the fact that volatility risk tends to increase or decrease through a specific maturity rather than across maturities. it is also possible to notice that more central stocks, depicted as bigger nodes in fig. 2 , have more connections but that this feature does not directly translate in a higher strength (depicted as darker colour of the nodes). this is a feature already emphasized in [3] for financial networks. fig. 2 . estimated multilayer network. node colours: loglog scale; darker colour is associated to higher strength of the node. node size: loglog scale; darker colour is associated to higher k-coreness score. edge colour: uniform. from a financial point of view, such graphical representation put together three different aspects of financial risk: market risk, liquidity risk (in terms of volumes exchanged) and forward looking uncertainty measures, which account for expected volatility risk. in fact, the stocks in the volumes layer are not strongly interconnected but produce a huge amount of risk propagation through prices and volatility. understanding the dynamics of such multilayer network representation would be a useful tool for risk managers in order to understand risk balances and propose risk mitigation techniques. in this paper, we proposed a methodology to build a multilayer network via the estimated coefficient of the tucker tensor autoregression of [6] . this methodology, in combination with a filtering technique, has proven able to reproduce interconnections between different financial risk factors. these interconnections can be easily mapped to real financial mechanisms and can be a useful tool for monitoring risk as the topology within and between layers can be strongly affected in distressed periods. in order to preserve the maximum memory information in the data but requiring stationarity, we made use of fractional differentiation and found out that the variables analyzed are stationary with differentiation of order α = 0.2. the model can be extendedto a dynamic framework in order to analyze the dependency structures under different market conditions. the multiplex dependency structure of financial markets risk diversification: a study of persistence with a filtered correlation-network approach systemic liquidity contagion in the european interbank market the structure and dynamics of multilayer networks unveil stock correlation via a new tensor-based decomposition method predicting multidimensional data via tensor learning a fast fractional difference algorithm a pólya urn approach to information filtering in complex networks tensor regression with applications in neuroimaging data analysis parsimonious tensor response regression tensor-on-tensor regression on the stability of inverse problems a decomposition of the tikhonov regularization functional oriented to exploit hybrid multilevel parallelism principal component analysis of three-mode data by means of alternating least squares algorithms introduction to statistical time series complex networks on hyperbolic surfaces key: cord-005090-l676wo9t authors: gao, chao; liu, jiming; zhong, ning title: network immunization and virus propagation in email networks: experimental evaluation and analysis date: 2010-07-14 journal: knowl inf syst doi: 10.1007/s10115-010-0321-0 sha: doc_id: 5090 cord_uid: l676wo9t network immunization strategies have emerged as possible solutions to the challenges of virus propagation. in this paper, an existing interactive model is introduced and then improved in order to better characterize the way a virus spreads in email networks with different topologies. the model is used to demonstrate the effects of a number of key factors, notably nodes’ degree and betweenness. experiments are then performed to examine how the structure of a network and human dynamics affects virus propagation. the experimental results have revealed that a virus spreads in two distinct phases and shown that the most efficient immunization strategy is the node-betweenness strategy. moreover, those results have also explained why old virus can survive in networks nowadays from the aspects of human dynamics. the internet, the scientific collaboration network and the social network [15, 32] . in these networks, nodes denote individuals (e.g. computers, web pages, email-boxes, people, or species) and edges represent the connections between individuals (e.g. network links, hyperlinks, relationships between two people or species) [26] . there are many research topics related to network-like environments [23, 34, 46] . one interesting and challenging subject is how to control virus propagation in physical networks (e.g. trojan viruses) and virtual networks (e.g. email worms) [26, 30, 37] . currently, one of the most popular methods is network immunization where some nodes in a network are immunized (protected) so that they can not be infected by a virus or a worm. after immunizing the same percentages of nodes in a network, the best strategy can minimize the final number of infected nodes. valid propagation models can be used in complex networks to predict potential weaknesses of a global network infrastructure against worm attacks [40] and help researchers understand the mechanisms of new virus attacks and/or new spreading. at the same time, reliable models provide test-beds for developing or evaluating new and/or improved security strategies for restraining virus propagation [48] . researchers can use reliable models to design effective immunization strategies which can prevent and control virus propagation not only in computer networks (e.g. worms) but also in social networks (e.g. sars, h1n1, and rumors). today, more and more researchers from statistical physics, mathematics, computer science, and epidemiology are studying virus propagation and immunization strategies. for example, computer scientists focus on algorithms and the computational complexities of strategies, i.e. how to quickly search a short path from one "seed" node to a targeted node just based on local information, and then effectively and efficiently restrain virus propagation [42] . epidemiologists focus on the combined effects of local clustering and global contacts on virus propagation [5] . generally speaking, there are two major issues concerning virus propagation: 1. how to efficiently restrain virus propagation? 2. how to accurately model the process of virus propagation in complex networks? in order to solve these problems, the main work in this paper is to (1) systematically compare and analyze representative network immunization strategies in an interactive email propagation model, (2) uncover what the dominant factors are in virus propagation and immunization strategies, and (3) improve the predictive accuracy of propagation models through using research from human dynamics. the remainder of this paper is organized as follows: sect. 2 surveys some well-known network immunization strategies and existing propagation models. section 3 presents the key research problems in this paper. section 4 describes the experiments which are performed to compare different immunization strategies with the measurements of the immunization efficiency, the cost and the robustness in both synthetic networks (including a synthetic community-based network) and two real email networks (the enron and a university email network), and analyze the effects of network structures and human dynamics on virus propagation. section 5 concludes the paper. in this section, several popular immunization strategies and typical propagation models are reviewed. an interactive email propagation model is then formulated in order to evaluate different immunization strategies and analyze the factors that influence virus propagation. network immunization is one of the well-known methods to effectively and efficiently restrain virus propagation. it cuts epidemic paths through immunizing (injecting vaccines or patching programs) a set of nodes from a network following some well-defined rules. the immunized nodes, in most published research, are all based on node degrees that reflect the importance of a node in a network, to a certain extent. in this paper, the influence of other properties of a node (i.e. betweenness) on immunization strategies will be observed. pastor-satorras and vespignani have studied the critical values in both random and targeted immunization [39] . the random immunization strategy treats all nodes equally. in a largescale-free network, the immunization critical value is g c → 1. simulation results show that 80% of nodes need to be immunized in order to recover the epidemic threshold. dezso and barabasi have proposed a new immunization strategy, named as the targeted immunization [9] , which takes the actual topology of a real-world network into consideration. the distributions of node degrees in scale-free networks are extremely heterogeneous. a few nodes have high degrees, while lots of nodes have low degrees. the targeted immunization strategy aims to immunize the most connected nodes in order to cut epidemic paths through which most susceptible nodes may be infected. for a ba network [2] , the critical value of the targeted immunization strategy is g c ∼ e −2 mλ . this formula shows that it is always possible to obtain a small critical value g c even if the spreading rate λ changes drastically. however, one of the limitations of the targeted immunization strategy is that it needs to know the information of global topology, in particular the ranking of the nodes must be clearly defined. this is impractical and uneconomical for handling large-scale and dynamic-evolving networks, such as p2p networks or email networks. in order to overcome this shortcoming, a local strategy, namely the acquaintance immunization [8, 16] , has been developed. the motivation for the acquaintance immunization is to work without any global information. in this strategy, p % of nodes are first selected as "seeds" from a network, and then one or more of their direct acquaintances are immunized. because a node with higher degree has more links in a scale-free network, it will be selected as a "seed" with a higher probability. thus, the acquaintance immunization strategy is more efficient than the random immunization strategy, but less than the targeted immunization strategy. moreover, there is another issue which limits the effectiveness of the acquaintance immunization: it does not differentiate nodes, i.e. randomly selects "seed" nodes and their direct neighbors [17] . another effective distributed strategy is the d-steps immunization [12, 17] . this strategy views the decentralized immunization as a graph covering problem. that is, for a node v i , it looks for a node to be immunized that has the maximal degree within d steps of v i . this method only uses the local topological information within a certain range (e.g. the degree information of nodes within d steps). thus, the maximal acquaintance strategy can be seen as a 1-step immunization. however, it does not take into account domain-specific heuristic information, nor is it able to decide what the value of d should be in different networks. the immunization strategies described in the previous section are all based on node degrees. the way different immunized nodes are selected is illustrated in fig. 1 1 an illustration of different strategies. the targeted immunization will directly select v 5 as an immunized node based on the degrees of nodes. suppose that v 7 is a "seed" node. v 6 will be immunized based on the maximal acquaintance immunization strategy, and v 5 will be indirectly selected as an immunized node based on the d-steps immunization strategy, where d = 2 fig. 2 an illustration of betweenness-based strategies. if we select one immunized node, the targeted immunization strategy will directly select the highest-degree node, v 6 . the node-betweenness strategy will select v 5 as it has the highest node betweenness. the edge-betweenness strategy will select one of v 3 , v 4 and v 5 because the edges, l 1 and l 2 , have the highest edge betweenness the highest-degree nodes from a network, many approaches cut epidemic paths by means of increasing the average path length of a network, for example by partitioning large-scale networks based on betweenness [4, 36] . for a network, node (edge) betweenness refers to the number of the shortest paths that pass through a node (edge). a higher value of betweenness means that the node (edge) links more adjacent communities and will be frequently used in network communications. although [19] have analyzed the robustness of a network against degree-based and betweenness-based attacks, the spread of a virus in a propagation model is not considered, so the effects of different measurements on virus propagation is not clear. is it possible to restrain virus propagation, especially from one community to another, by immunizing nodes or edges which have higher betweenness. in this paper, two types of betweenness-based immunization strategies will be presented, i.e. the node-betweenness strategy and the edge-betweenness strategy. that is, the immunized nodes are selected in the descending order of node-and edge-betweenness, in an attempt to better understand the effects of the degree and betweenness centralities on virus propagation. figure 2 shows that if v 4 is immunized, the virus will not propagate from one part of the network to another. the node-betweenness strategy will select v 5 as an immunized node, which has the highest node betweenness, i.e. 41. the edge-betweenness strategy will select the terminal nodes of l 1 or l 2 (i.e. v 3 , v 4 or v 4 , v 5 ) as they have the highest edge betweenness. as in the targeted immunization, the betweenness-based strategies also require information about the global betweenness of a network. the experiments presented in this paper is to find a new measurement that can be used to design a highly efficient immunization strategy. the efficiency of these strategies is compared both in synthetic networks and in real-world networks, such as the enron email network described by [4] . in order to compare different immunization strategies, a propagation model is required to act as a test-bed in order to simulate virus propagation. currently, there are two typical models: (1) the epidemic model based on population simulation and (2) an interactive email model which utilizes individual-based simulation. lloyd and may have proposed an epidemic propagation model to characterize virus propagation, a typical mathematical model based on differential equations [26] . some specific epidemic models, such as si [37, 38] , sir [1, 30] , sis [14] , and seir [11, 28] , have been developed and applied in order to simulate virus propagation and study the dynamic characteristics of whole systems. however, these models are all based on the mean-filed theory, i.e. differential equations. this type of black-box modeling approach only provides a macroscopic understanding of virus propagation-they do not give much insight into microscopic interactive behavior. more importantly, some assumptions, such as a fully mixed (i.e. individuals that are connected with a susceptible individual will be randomly chosen from the whole population) [33] and equiprobable contacts (i.e. all nodes transmit the disease with the same probability and no account is taken of the different connections between individuals) may not be valid in the real world. for example, in email networks and instant message (im) networks, communication and/or the spread of information tend to be strongly clustered in groups or communities that have more closer relationships rather than being equiprobable across the whole network. these models may also overestimate the speed of propagation [49] . in order to overcome the above-mentioned shortcomings, [49] have built an interactive email model to study worm propagation, in which viruses are triggered by human behavior, not by contact probabilities. that is to say, the node will be infected only if a user has checked his/her email-box and clicked an email with a virus attachment. thus, virus propagation in the email network is mainly determined by two behavioral factors: email-checking time intervals (t i ) and email-clicking probabilities (p i ), where i ∈ [1, n ] , n is the total number of users in a network. t i is determined by a user's own habits; p i is determined both by user security awareness and the efficiency of the firewall. however, the authors do not provide much information about how to restrain worm propagation. in this paper, an interactive email model is used as a test-bed to study the characteristics of virus propagation and the efficiency of different immunization strategies. it is readily to observe the microscopic process of worm propagating through this model, and uncover the effects of different factors (e.g. the power-law exponent, human dynamics and the average path length of the network) on virus propagation and immunization strategies. unlike other models, this paper mainly focuses on comparing the performance of degree-based strategies and betweenness-based strategies, replacing the critical value of epidemics in a network. a detailed analysis of the propagation model is given in the following section. an email network can be viewed as a typical social network in which a connection between two nodes (individuals) indicates that they have communicated with each other before [35, 49] . generally speaking, a network can be denoted as e = v, l , where v = {v 1 , v 2 , . . . , v n } is a set of nodes and l = { v i , v j | 1 ≤ i, j ≤ n} is a set of undirected links (if v i in the hit-list of v j , there is a link between v i and v j ). a virus can propagate along links and infect more nodes in a network. in order to give a general definition, each node is represented as a tuple . -id: the node identifier, v i .i d = i. -state: the node state: i f the node has no virus, danger = 1, i f the node has virus but not in f ected, in f ected = 2, i f the node has been in f ected, immuni zed = 3, i f the node has been immuni zed. -nodelink: the information about its hit-list or adjacent neighbors, i.e. v i .n odelink = { i, j | i, j ∈ l}. -p behavior : the probability that a node will to perform a particular behavior. -b action : different behaviors. -virusnum: the total number of new unchecked viruses before the next operation. -newvirus: the number of new viruses a node receives from its neighbors at each step. in addition, two interactive behaviors are simulated according to [49] , i.e. the emailchecking time intervals and the email-clicking probabilities both follow gaussian distributions, if the sample size goes to infinity. for the same user i, the email-checking interval t i (t) in [49] has been modeled by a poisson distribution, i.e. t i (t) ∼ λe −λt . thus, the formula for p behavior in the tuple can be written as p 1 behavior = click prob and p 2 behavior = checkt ime. -clickprob is the probability of an user clicking a suspected email, -checkrate is the probability of an user checking an email, -checktime is the next time the email-box will be checked, v i .p 2 behavior = v i .checkt ime = ex pgenerator(v i .check rate). b action can be specified as b 1 action = receive_email, b 2 action = send_email, and b 3 action = update_email. if a user receives a virus-infected email, the corresponding node will update its state, i.e. v i .state ← danger. if a user opens an email that has a virus-infected attachment, the node will adjust its state, i.e. v i .state ← in f ected, and send this virus email to all its friends, according to its hit-list. if a user is immunized, the node will update its state to v i .state ← immuni zed. in order to better characterize virus propagation, some assumptions are made in the interactive email model: -if a user opens an infected email, the node is infected and will send viruses to all the friends on its hit-list; -when checking his/her mailbox, if a user does not click virus emails, it is assumed that the user deletes the suspected emails; -if nodes are immunized, they will never send virus emails even if a user clicks an attachment. the most important measurement of the effectiveness of an immunization strategy is the total number of infected nodes after virus propagation. the best strategy can effectively restrain virus propagation, i.e. the total number of infected nodes is kept to a minimum. in order to evaluate the efficiency of different immunization strategies and find the relationship between local behaviors and global dynamics, two statistics are of particular interest: 1. sid: the sum of the degrees of immunized nodes that reflects the importance of nodes in a network 2. apl: the average path length of a network. this is a measurement of the connectivity and transmission capacity of a network where d i j is the shortest path between i and j. if there is no path between i and j, d i j → ∞. in order to facilitate the computation, the reciprocal of d i j is used to reflect the connectivity of a network: if there is no path between i and j, d −1 i j = 0. based on these definitions, the interactive email model given in sect. 2.3 can be used as a test-bed to compare different immunization strategies and uncover the effects of different factors on virus propagation. the specific research questions addressed in this paper can be summarized as follows: 1. how to evaluate network immunization strategies? how to determine the performance of a particular strategy, i.e. in terms of its efficiency, cost and robustness? what is the best immunization strategy? what are the key factors that affect the efficiency of a strategy? 2. what is the process of virus propagation? what effect does the network structure have on virus propagation? 3. what effect do human dynamics have on virus propagation? the simulations in this paper have two phases. first, a existing email network is established in which each node has some of the interactive behaviors described in sect. 2.3. next, the virus propagation in the network is observed and the epidemic dynamics are studied when applying different immunization strategies. more details can be found in sect. 4. in this section, the simulation process and the structures of experimental networks are presented in sects. 4.1 and 4.2. section 4.3 uses a number of experiments to evaluate the performance (e.g. efficiency, cost and robustness) of different immunization strategies. specifically, the experiments seek to address whether or not betweenness-based immunization strategies can restrain worm propagation in email networks, and which measurements can reflect and/or characterize the efficiency of immunization strategies. finally, sects. 4.4 and 4.5 presents an in-depth analysis in order to determine the effect of network structures and human dynamics on virus propagation. the experimental process is illustrated in fig. 3 . some nodes are first immunized (protected) from the network using different strategies. the viruses are then injected into the network in order to evaluate the efficiency of those strategies by comparing the total number of infected nodes. two methods are used to select the initial infected nodes: random infection and malicious infection, i.e. infecting the nodes with maximal degrees. the user behavior parameters are based on the definitions in sect. 2.3, where μ p = 0.5, σ p = 0.3, μ t = 40, and σ t = 20. since the process of email worm propagation is stochastic, all results are averaged over 100 runs. the virus propagation algorithm is specified in alg. 1. many common networks have presented the phenomenon of scale-free [2, 21] , where nodes' degrees follow a power-law distribution [42] , i.e. the fraction of nodes having k edges, p(k), decays according to a power law p(k) ∼ k −α (where α is usually between 2 and 3) [29] . recent research has shown that email networks also follow power-law distributions with a long tail [35, 49] . therefore, in this paper, three synthetic power-law networks and a synthetic community-based network, generated using the glp algorithm [6] where the power can be tuned. the three synthetic networks all have 1000 nodes with α =1.7, 2.7, and 3.7, respectively. the statistical characteristics and visualization of the synthetic community-based network are shown in table 1 and fig. 4c , f, respectively. in order to reflect the characteristics of a real-world network, the enron email network 1 which is built by andrew fiore and jeff heer, and the university email network 2 which is complied by the members of the university rovira i virgili (tarragona) will also be studied. the structure and degree distributions of these networks are shown in table 2 and fig. 4 . in particular, the cumulative distributions are estimated with maximum likelihood using the method provided by [7] . the degree statistics are shown in table 9 . in this section, a comparison is made of the effectiveness of different strategies in an interactive email model. experiments are then used to evaluate the cost and robustness of each strategy. input: nodedata[nodenum] stores the topology of an email network. timestep is the system clock. v 0 is the set of initially infected nodes. output: simnum[timestep] [k] stores the number of infected nodes in the network in the k th simulation. (1) for k=1 to runtime //we run 100 times to obtain an average value (2) nodedata[nodenum] ← initializing an email network as well as users' checking time and clicking probability; (3) nodedata[nodenum] ← choosing immunized nodes based on different immunization strategies and adjusting their states; (4) while timestep < endsimul //there are 600 steps at each time (5) for i=1 to nodenum (6) if nodedata[i].checktime==0 (7) prob← computing the probability of opening a virus-infected email based on user's clickprob and virusnum (8) if send a virus to all friends according to its hit-list (12) endif (13) endif (14) endfor (15) for i=1 to nodenum (16) update the next checktime based on user's checkrate (17) nodedata the immunization efficiency of the following immunization strategies are compared: the targeted and random strategies [39] , the acquaintance strategy (random and maximal neighbor) [8, 16] , the d-steps strategy (d = 2 and d = 3) [12, 17] (which is introduced in sect. 2.1), the bridges between different communities: 100 the whole network: α=1.77, k =8.34 and the proposed betweenness-based strategy (node-and edge-betweenness). in the initial set of experiments, the proportion of immunized nodes (5, 10, and 30%) are varied in the synthetic networks and the enron email network. table 3 shows the simulation results in the enron email network which is initialized with two infected nodes. figure 5 shows the average numbers of infected nodes over time. tables 4, 5 , and 6 show the numerical results in three synthetic networks, respectively. the simulation results show that the node-betweenness immunization strategy yields the best results (i.e. the minimum number of infected nodes, f) except for the case where 5% of the nodes in the enron network are immunized under a malicious attack. the average degree of the enron network is k = 3.4. this means that only a few nodes have high degrees, others have low degrees (see table 9 ). in such a network, if nodes with maximal degrees are infected, viruses will rapidly spread in the network and the final number of infected nodes will be larger than in other cases. the targeted strategy therefore does not perform any better than the node-betweenness strategy. in fact, as the number of immunized nodes increases, the efficiency of the node-betweenness immunization increases proportionally there are two infected nodes with different attack modes. if there is no immunization, the final number of infected nodes is 937 with a random attack and 942 with a malicious attack, and ap l = 751.36(10 −4 ). the total simulation time t = 600 more than the targeted strategy. therefore, if global topological information is available, the node-betweenness immunization is the best strategy. the maximal s i d is obtained using the targeted immunization. however, the final number of infected nodes (f) is consistent with the average path length (ap l) but not with the s i d. that is to say, controlling a virus epidemic does not depend on the degrees of immunized nodes but on the path length of a whole network. this also explains why the efficiency of the node-betweenness immunization strategy is better than that of the targeted immunization strategy. the node-betweenness immunization selects nodes based on the average path length, while the targeted immunization strategy selects based on the size of degrees. a more in-depth analysis is undertaken by comparing the change of the ap l with respect to the different strategies used in the synthetic networks. the results are shown in fig. 6 . figure 7a , b compare the change of the final number of infected nodes over time, which correspond to fig. 6c , d, respectively. these numerical results validate the previous assertion that the average path length can be used as a measurement to design an effective immunization strategy. the best strategy is to divide the whole network into different sub-networks and increase the average path length of a network, hence cut the epidemic paths. in this paper, all comparative results are the average over 100 runs using the same infection model (i.e. the virus propagation is compared for both random and malicious attacks) and user behavior model (i.e. all simulations use the same behavior parameters, as shown in sect. 4.1). thus, it is more reasonable and feasible to just evaluate how the propagation of a virus is affected by immunization strategies, i.e. avoiding the effects caused by the stochastic process, the infection model and the user behavior. it can be seen that the edge-betweenness strategy is able to find some nodes with high degrees of centrality and then integrally divide a network into a number of sub-networks (e.g. v 4 in fig. 2) . however, compared with the nodes (e.g. v 5 in fig. 2 ) selected by the node-betweenness strategy, the nodes with higher edge betweenness can not cut the epidemic paths as they can not effectively break the whole structure of a network. in fig. 2 , the synthetic community-based network and the university email network are used as examples to illustrate why the edge-betweenness strategy can not obtain the same immunization efficiency as the node-betweenness strategy. to select two nodes as immunized nodes from fig. 2 , the node-betweenness immunization will select {v 5 , v 3 } by using the descending order of node betweenness. however, the edge-betweenness strategy can select {v 3 , v 4 } or {v 4 , v 5 } because the edges, l 1 and l 2 , have the highest edge betweenness. this result shows that the node-betweenness strategy can not only effectively divide the whole network into two communities, but also break the interior structure of communities. although the edgebetweenness strategy can integrally divided the whole network into two parts, viruses can also propagate in each community. many networks commonly contain the structure shown in fig. 2 , for example, the enron email network and university email networks. table 7 and fig. 8 present the results of the synthetic community-based network. table 8 compares different strategies in the university email network, which also has some self-similar community structures [18] . these results further validate the analysis stated above. from the above experiments, the following conclusions can be made: tables 4-8 , ap l can be used as a measurement to evaluate the efficiency of an immunization strategy. thus, when designing a distributed immunization strategy, attentions should be paid on those nodes that have the largest impact on the apl value. 2. if the final number of infected nodes is used as a measure of efficiency, then the nodebetweenness immunization strategy is more efficient than the targeted immunization strategy. 3. the power-law exponent (α) affects the edge-betweenness immunization strategy, but has a little impact on other strategies. in the previous section, the efficiency of different immunization strategies is evaluated in terms of the final number of infected nodes when the propagation reaches an equilibrium state. by doing experiments in synthetic networks, synthetic community-based network, the enron email network and the university email network, it is easily to find that the node-betweenness immunization strategy has the highest efficiency. in this section, the performance of the different strategies will be evaluated in terms of cost and robustness, as in [20] . it is well known that the structure of a social network or an email network constantly evolves. it is therefore interesting to evaluate how changes in structure affect the efficiency of an immunization strategy. -the cost can be defined as the number of nodes that need to be immunized in order to achieve a given level of epidemic prevalence ρ. generally, ρ → 0. there are some parameters which are of particular interest: f is the fraction of nodes that are immunized; f c is the critical value of the immunization when ρ → 0; ρ 0 is the infection density when no immunization strategy is implemented; ρ f is the infection density with a given immunization strategy. figure 9 shows the relationship between the reduced prevalence ρ f /ρ 0 and f. it can be seen that the node-betweenness immunization has the lowest prevalence for the smallest number of protected nodes. the immunization cost increases as the value of α increases, i.e. in order to achieve epidemic prevalence ρ → 0, the node-betweenness immunization strategy needs 20, 25, and 30% of nodes to be immunized, respectively, in the three synthetic networks. this is because the node-betweenness immunization strategy can effectively break the network structure and increase the path length of a network with the same number of immunized nodes. -the robustness shows a plot of tolerance against the dynamic evolution of a network, i.e. the change of power-law exponents (α). figure 10 shows the relationship between the immunized threshold f c and α. a low level of f c with a small variation indicates that the immunization strategy is robust. the robustness is important when an immunization strategy is deployed into a scalable and dynamic network (e.g. p2p and email networks). figure 10 also shows the robustness of the d-steps immunization strategy is close to that of the targeted immunization; the node-betweenness strategy is the most robust. [49] have compared virus propagation in synthetic networks with α = 1.7 and α = 1.1475, and pointed out that initial worm propagation has two phases. however, they do not give a detailed explanation of these results nor do they compare the effect of the power-law exponent on different immunization strategies during virus propagation. table 9 presents the detailed degree statistics for different networks, which can be used to examine the effect of the power-law exponent on virus propagation and immunization strategies. first, virus propagation in non-immunized networks is discussed. figure 11a shows the changes of the average number of infected nodes over time; fig. 11b gives the average degree of infected nodes at each time step. from the results, it can be seen that 1. the number of infected nodes in non-immunized networks is determined by attack modes but not the power-law exponent. in figs. 11a , b, three distribution curves (α = 1.7, 2.7, and 3.7) overlap with each other in both random and malicious attacks. the difference between them is that the final number of infected nodes with a malicious attack is larger than that with a random attack, as shown in fig. 11a , reflecting the fact that a malicious attack is more dangerous than a random attack. 2. a virus spreads more quickly in a network with a large power-law exponent than that with a small exponent. because a malicious attack initially infects highly connected nodes, the average degree of the infected nodes decreases in a shorter time comparing to a random attack (t 1 < t 2). moreover, the speed and range of the infection is amplified by those highly connected nodes. in phase i, viruses propagate very quickly and infect most nodes in a network. however, in phase ii, the number of total infected nodes grows slowly (fig. 11a) , because viruses aim to infect those nodes with low degrees (fig. 11b) , and a node with fewer links is more difficult to be infected. in order to observe the effect of different immunization strategies on the average degree of infected nodes in different networks, 5% of the nodes are initially protected against random and malicious attacks. figure 12 shows the simulation results. from this experiment, it can be concluded that 1. the random immunization has no effect on restraining virus propagation because the curves of the average degree of the infected nodes are basically coincident with the curves in the non-immunization case. 2. comparing fig. 12a , b, c and d, e, f, respectively, it can be seen that the peak value of the average degree is the largest in the network with α=1.7 and the smallest in the network with α=3.7. this is because the network with a lower exponent has more highly connected nodes (i.e. the range of degrees is between 50 and 80), which serve as amplifiers in the process of virus propagation. 3. as α increases, so does the number of infected nodes and the virus propagation duration (t 1 < t 2 < t 3). because a larger α implies a larger ap l , the number of infected nodes will increase; if the network has a larger exponent, a virus need more time to infect those nodes with medium or low degrees. fig. 14 the average number of infected nodes and the average degree of infected nodes, with respect to time when virus spreading in different networks. we apply the targeted immunization to protect 30% nodes in the network first, consider the process of virus propagation in the case of a malicious attack where 30% of the nodes are immunized using the edge-betweenness immunization strategy. there are two intersections in fig. 13a . point a is the intersection of two curves net1 and net3, and point b is the intersection of net2 and net1. under the same conditions, fig. 13a shows that the total number of infected nodes is the largest in net1 in phase i. corresponding to fig. 13b , the average degree of infected nodes in net1 is the largest in phase i. as time goes on, the rate at which the average degree falls is the fastest in net1, as shown in fig. 13b . this is because there are more highly connected nodes in net1 than in the others (see table 9 ). after these highly connected nodes are infected, viruses attempt to infect the nodes with low degrees. therefore, the average degree in net3 that has the smallest power-law exponent is larger than those in phases ii and iii. the total number of infected nodes in net3 continuously increases, exceeding those in net1 and net2. the same phenomenon also appears in the targeted immunization strategy, as shown in fig. 14. the email-checking intervals in the above interactive email model (see sect. 2.3) is modeled using a poisson process. the poisson distribution is widely used in many real-world models to statistically describe human activities, e.g. in terms of statistical regularities on the frequency of certain events within a period of time [25, 49] . statistics from user log files to databases that record the information about human activities, show that most observations on human behavior deviate from a poisson process. that is to say, when a person engages in certain activities, his waiting intervals follow a power-law distribution with a long tail [27, 43] . vazquez et al. [44] have tried to incorporate an email-sending interval distribution, characterized by a power-law distribution, into a virus propagation model. however, their model assumes that a user is instantly infected after he/she receives a virus email, and ignores the impact of anti-virus software and the security awareness of users. therefore, there are some gaps between their model and the real world. in this section, the statistical properties associated with a single user sending emails is analyzed based on the enron dataset [41] . the virus spreading process is then simulated using an improved interactive email model in order to observe the effect of human behavior on virus propagation. research results from the study of statistical regularities or laws of human behavior based on empirical data can offer a valuable perspective to social scientists [45, 47] . previous studies have also used models to characterize the behavioral features of sending emails [3, 13, 22] , but their correctness needs to be further empirically verified, especially in view of the fact that there exist variations among different types of users. in this paper, the enron email dataset is used to identify the characteristics of human email-handling behavior. due to the limited space, table 10 presents only a small amount of the employee data contained in the database. as can be seen from the table, the interval distribution of email sent by the same user is respectively measured using different granularities: day, hour, and minute. figure 15 shows that the waiting intervals follow a heavy-tailed distribution. the power-law exponent as the day granularity is not accurate because there are only a few data points. if more data points are added, a power-law distribution with long tail will emerge. note that, there is a peak at t = 16 as measured at an hour granularity. eckmann et al. [13] have explained that the peak in a university dataset is the interval between the time people leave work and the time they return to their offices. after curve fitting, see fig. 15 , the waiting interval exponent is close to 1.3, i.e. α ≈ 1.3 ± 0.5. although it has been shown that an email-sending distribution follows a power-law by studying users in the enron dataset, it is still not possible to assert that all users' waiting intervals follow a power-law distribution. it can only be stated that the distribution of waiting intervals has a long-tail characteristic. it is also not possible to measure the intervals between email checking since there is no information about login time in the enron dataset. however, combing research results from human web browsing behavior [10] and the effect of non-poisson activities on propagation in the barabasi group [44] , it can be found that there are similarities between the distributions of email-checking intervals and email-sending intervals. the following section uses a power-law distribution to characterize the behavior associated with email-checking in order to observe the effect human behavior has on the propagation of an email virus. based on the above discussions, a power-law distribution is used to model the email-checking intervals of a user i, instead of the poisson distribution used in [49] , i.e. t i (τ ) ∼ τ −α . an analysis of the distribution of the power-law exponent (α) for different individuals in web browsing [10] and in the enron dataset shows that the power-law exponent is approximately 1.3. in order to observe and quantitatively analyze the effect that the email-checking interval has on virus propagation, the email-clicking probability distribution (p i ) in our model is consistent with the one used by [49] , i.e. the security awareness of different users in the network follows a normal distribution, p i ∼ n (0.5, 0.3 2 ). figure 16 shows that following a random attack viruses quickly propagate in the enron network if the email-checking intervals follow a power-law distribution. the results are consistent with the observed trends in real computer networks [31] , i.e. viruses initially spread explosively, then enter a long latency period before becoming active again following user activity. the explanation for this is that users frequently have a short period of focused activity followed by a long period of inactivity. thus, although old viruses may be killed by anti-virus software, they can still intermittently break out in a network. that is because some viruses are hidden by inactive users, and cannot be found by anti-virus software. when the inactive users become active, the virus will start to spread again. the effect of human dynamics on virus propagation in three synthetic networks is also analyzed by applying the targeted [9] , d-steps [17] and aoc-based strategy [24] . the numerical results are shown in table. 11 and fig. 17 . from the above experiments, the following conclusions can be made: 1. based on the enron email dataset and recent research on human dynamics, the emailchecking intervals in an interactive email model should be assigned based on a power-law distribution. 2. viruses can spread very quickly in a network if users' email-checking intervals follow a power-law distribution. in such a situation, viruses grow explosively at the initial stage and then grow slowly. the viruses remain in a latent state and await being activated by users. in this paper, a simulation model for studying the process of virus propagation has been described, and the efficiency of various existing immunization strategies has been compared. in particular, two new betweenness-based immunization strategies have been presented and validated in an interactive propagation model, which incorporates two human behaviors based on [49] in order to make the model more practical. this simulation-based work can be regarded as a contribution to the understanding of the inter-reactions between a network structure and local/global dynamics. the main results are concluded as follows: 1. some experiments are used to systematically compare different immunization strategies for restraining epidemic spreading, in synthetic scale-free networks including the community-based network and two real email networks. the simulation results have shown that the key factor that affects the efficiency of immunization strategies is apl, rather than the sum of the degrees of immunized nodes (sid). that is to say, immunization strategy should protect nodes with higher connectivity and transmission capability, rather than those with higher degrees. 2. some performance metrics are used to further evaluate the efficiency of different strategies, i.e. in terms of their cost and robustness. simulation results have shown that the d-steps immunization is a feasible strategy in the case of limited resources and the nodebetweenness immunization is the best if the global topological information is available. 3. the effects of power-law exponents and human dynamics on virus propagation are analyzed. more in-depth experiments have shown that viruses spread faster in a network with a large power-law exponent than that with a small one. especially, the results have explained why some old viruses can still propagate in networks up till now from the perspective of human dynamics. the mathematical theory of infectious diseases and its applications emergence of scaling in random networks the origin of bursts and heavy tails in human dynamics cluster ranking with an application to mining mailbox networks small worlds' and the evolution of virulence: infection occurs locally and at a distance on distinguishing between internet power law topology generators power-law distribution in empirical data efficient immunization strategies for computer networks and populations halting viruses in scale-free networks dynamics of information access on the web a simple model for complex dynamical transitions in epidemics distance-d covering problem in scalefree networks with degree correlation entropy of dialogues creates coherent structure in email traffic epidemic threshold in structured scale-free networks on power-law relationships of the internet topology improving immunization strategies immunization of real complex communication networks self-similar community structure in a network of human interactions attack vulnerability of complex networks targeted local immunization in scale-free peer-to-peer networks the large scale organization of metabolic networks probing human response times periodic subgraph mining in dynamic networks. knowledge and information systems autonomy-oriented search in dynamic community networks: a case study in decentralized network immunization characterizing web usage regularities with information foraging agents how viruses spread among computers and people on universality in human correspondence activity enhanced: simple rules with complex dynamics network motifs simple building blocks of complex networks epidemics and percolation in small-world network code-red: a case study on the spread and victims of an internet worm the structure of scientific collaboration networks the spread of epidemic disease on networks the structure and function of complex networks email networks and the spread of computer viruses partitioning large networks without breaking communities epidemic spreading in scale-free networks epidemic dynamics and endemic states in complex networks immunization of complex networks computer virus propagation models the enron email dataset database schema and brief statistical report exploring complex networks modeling bursts and heavy tails in human dynamics impact of non-poissonian activity patterns on spreading process predicting the behavior of techno-social systems a decentralized search engine for dynamic web communities a twenty-first century science an environment for controlled worm replication and analysis modeling and simulation study of the propagation and defense of internet e-mail worms chao gao is currently a phd student in the international wic institute, college of computer science and technology, beijing university of technology. he has been an exchange student in the department of computer science, hong kong baptist university. his main research interests include web intelligence (wi), autonomy-oriented computing (aoc), complex networks analysis, and network security. department at hong kong baptist university. he was a professor and the director of school of computer science at university of windsor, canada. his current research interests include: autonomy-oriented computing (aoc), web intelligence (wi), and self-organizing systems and complex networks, with applications to: (i) characterizing working mechanisms that lead to emergent behavior in natural and artificial complex systems (e.g., phenomena in web science, and the dynamics of social networks and neural systems), and (ii) developing solutions to large-scale, distributed computational problems (e.g., distributed scalable scientific or social computing, and collective intelligence). prof. liu has contributed to the scientific literature in those areas, including over 250 journal and conference papers, and 5 authored research monographs, e.g., autonomy-oriented computing: from problem solving to complex systems modeling (kluwer academic/springer) and spatial reasoning and planning: geometry, mechanism, and motion (springer). prof. liu has served as the editor-in-chief of web intelligence and agent systems, an associate editor of ieee transactions on knowledge and data engineering, ieee transactions on systems, man, and cybernetics-part b, and computational intelligence, and a member of the editorial board of several other international journals. laboratory and is a professor in the department of systems and information engineering at maebashi institute of technology, japan. he is also an adjunct professor in the international wic institute. he has conducted research in the areas of knowledge discovery and data mining, rough sets and granular-soft computing, web intelligence (wi), intelligent agents, brain informatics, and knowledge information systems, with more than 250 journal and conference publications and 10 books. he is the editor-in-chief of web intelligence and agent systems and annual review of intelligent informatics, an associate editor of ieee transactions on knowledge and data engineering, data engineering, and knowledge and information systems, a member of the editorial board of transactions on rough sets. key: cord-218639-ewkche9r authors: ghavasieh, arsham; bontorin, sebastiano; artime, oriol; domenico, manlio de title: multiscale statistical physics of the human-sars-cov-2 interactome date: 2020-08-21 journal: nan doi: nan sha: doc_id: 218639 cord_uid: ewkche9r protein-protein interaction (ppi) networks have been used to investigate the influence of sars-cov-2 viral proteins on the function of human cells, laying out a deeper understanding of covid--19 and providing ground for drug repurposing strategies. however, our knowledge of (dis)similarities between this one and other viral agents is still very limited. here we compare the novel coronavirus ppi network against 45 known viruses, from the perspective of statistical physics. our results show that classic analysis such as percolation is not sensitive to the distinguishing features of viruses, whereas the analysis of biochemical spreading patterns allows us to meaningfully categorize the viruses and quantitatively compare their impact on human proteins. remarkably, when gibbsian-like density matrices are used to represent each system's state, the corresponding macroscopic statistical properties measured by the spectral entropy reveals the existence of clusters of viruses at multiple scales. overall, our results indicate that sars-cov-2 exhibits similarities to viruses like sars-cov and influenza a at small scales, while at larger scales it exhibits more similarities to viruses such as hiv1 and htlv1. the covid-19 pandemic, with global impact on multiple crucial aspects of human life, is still a public health threat in most areas of the world. despite the ongoing investigations aiming to find a viable cure, our knowledge of the nature of disease is still limited, especially regarding the similarities and differences it has with other viral infections. on the one hand, sars-cov-2 shows high genetic similarity to sars-cov 1 with the rise of network medicine [6] [7] [8] [9] [10] [11] , methods developed for complex networks analysis have been widely adopted to efficiently investigate the interdependence among genes, proteins, biological processes, diseases and drugs 12 . similarly, they have been used for characterizing the interactions between viral and human proteins in case of sars-cov-2 [13] [14] [15] , providing insights into the structure and function of the virus 16 and identifying drug repurposing strategies 17, 18 . however, a comprehensive comparison of sars-cov-2 against other viruses, from the perspective of network science, is still missing. here, we use statistical physics to analyze 45 viruses, including sars-cov-2. we consider the virus-human protein-protein interactions (ppi) as an interdependent system with two parts, human ppi network targeted by viral proteins. in fact, due to the large size of human ppi network, its structural properties barely change after being merged with viral components. consequently, we show that percolation analysis of such interdependent systems provides no information about the distinguishing features of viruses. instead, we model the propagation of perturbations from viral nodes through the whole system, using bio-chemical and regulatory dynamics, to obtain the spreading patterns and compare the average impact of viruses on human proteins. finally, we exploit gibbsian-like density matrices, recently introduced to map network states, to quantify the impact of viruses on the macroscopic functions of human ppi network, such as von neumann entropy. the inverse temperature β is used as a resolution parameter to perform a multiscale analysis. we use the above information to cluster together viruses and our findings indicate that sars-cov-2 groups with a number of pathogens associated with respiratory infections, including sars-cov, influenza a and human adenovirus (hadv) at the smallest scales, more influenced by local topological features. interestingly, at larger scales, it exhibits more similarity with viruses from distant families such as hiv1 and human t-cell leukemia virus type 1 (htlv1). our results shed light on the unexplored aspects of sars-cov-2, from the perspective of statistical physics of complex networks, and the presented framework opens the doors for further theoretical developments aiming to characterize structure and dynamics of virus-host interactions, as well as grounds for further experimental investigation and potentially novel clinical treatments. here, we use data regarding the viral proteins and their interactions with human proteins for 45 viruses (see methods and fig. 1) . to obtain the virus-human interactomes, we link the data to the biostr human ppi network (19, percolation of the interactomes. arguably, the simplest conceptual framework to assess how and why a networked system loses its functionality is via the process of percolation 19 . here, the structure of interconnected systems is modeled by a network g with n nodes, which can be fully represented by an adjacency matrix a (a ij = 1 if nodes i and j are connected, it is 0 oth20 . this point of view assumes that, as a first approximation, there is an intrinsic relation between connectivity and functionality: when the node removal occurs, the more capable of remaining assembled a system is, the better it will perform its tasks. hence, we have a quantitative way to assess the robustness of the system. if one wants to single out the role played by a certain property of the system, instead of selecting the nodes randomly, they can be sequentially removed following that criteria. for instance, if we want to find out what is the relevance of the most connected elements on the functionality, we can remove a fraction of the nodes with largest degree 21, 22 . technically, the criteria can be whatever metric that allows us to rank nodes, although in practical terms topologically-oriented protocols are the most frequently used due to their accessibility, such as degree, betweenness, etc. therefore percolation is, at all effects, a topological analysis, since its input and output are based on structural information. in the past, the usage of percolation has been proved useful to shed light on several aspects of protein-related networks, such as in the identification of functional clusters 23 and protein complexes 24 , the verification of the quality of functional annotations 25 or the critical properties as a function of mutation and duplication rates 26 , to name but a few. following this research line, we perform the percolation analysis to all the ppi networks to understand if this technique brings any information that allows us to differentiate among viruses. the considered protocols are the random selection of nodes, the targeting of nodes by degree -i.e., the number of connections they haveand their removal by betweenness centrality -i.e., a measure of the likelihood of a node to be in the information flow exchanged through the system by means of shortest paths. we apply these attack strategies and compute the resulting (normalized) size of the largest connected component s in the network, which serves as a proxy to the remaining functional part, as commented above. this way, when s is close to unity the function of the network has been scarcely impacted by the intervention, while when s is close to 0 the network can no longer be operative. the results are shown in fig. 3 . surprisingly, for each attacking protocol, we observe that the curves of the size of the largest connected component neatly collapse in a common curve. in other words, percolation analysis completely fails at finding virus-specific discriminators. viruses do respond differently depending on the ranking used, but this is somehow expected due to the correlation between the metrics employed and the position of the nodes in the network. we can shed some light on the similar virus-wise response to percolation by looking at topological structure of the interactomes. despite being viruses of diverse nature and causing such different symptomatology, their overall structure shows a high level of similarity when it comes to the protein-protein interaction. indeed, for every pair of viruses we find the fraction of nodes f n and fraction of links f l that simultaneously participate in both. averaging over all pairs, we obtain that f n = 0.9996 ± 0.0002 and f l = 0.9998 ± 0.0007. that means that the interactomes are structurally very similar, so the dismantling ranks. if purely topological analysis is not able to differentiate between viruses, then we need more convoluted, non-standard techniques to tackle this problem. in the next sections we will employ these alternative approaches. analysis of perturbation propagation. ppi networks represent the large scale set of interacting proteins. in the context of regulatory networks, edges encode dependencies for activation/inhibition with transcription factors. ppi edges can also represent the propensity for pairwise binding and the formation of complexes. the analytical treatment of these processes is described via bio-chemical dynamics 27, 28 and regulatory dynamics 29 . in bio-chemical (bio-chem) dynamics, these interactions are proportional to the product of concentrations of reactants, thus resulting in a second-order interaction, forming dimers. protein concentration x i (i = 1, 2, ..., n ) is also dependent on its degradation rate b i and the amount of protein synthesized at a rate f i . the resulting law of mass action: a ij x i x j summarizes the formation of complexes and degradation/synthesis processes that occur in a ppi. regulatory dynamics can be instead characterized by an interaction with neighbors described by a hill function that saturates at unity: in the context of the study of signal propagation, recent works have introduced the definition of network global correlation function 30, 31 as ultimately, the idea is that constant perturbation brings the system to a new steady state x i → x i + dx i , and dx i /x i quantifies the magnitude of the response of node i from the perturbation in j. this allows also the definition of measures such as impact 31 of a node as i i = j a ij g t ij describing the response of i's neighbors to its perturbation. interestingly, it was found that these measures can be described with power laws of degrees (i i ≈ k φ i ), via universal exponents dependent on the dynamics underlying odes allowing to effectively describe the interplay between topology and dynamics. in our case, φ = 0 for both processes, therefore the perturbation from i has the same impact on neighbors, regardless of its degree. we exploit the definition of g ij to define the vector g v of perturbations of concentrations induced by the interaction with the virus v, where the k-th entry is given by 31 the steps we follow to asses the impact of the viral nodes in the human interactome via the microscopic dynamics are described next. we first obtain the equilibrium states of human interactome by numerical integration of equations. then, for each virus, we compute the system response from perturbations starting in ∀i ∈ v which is eventually encoded in g v . finally, we repeat these steps for both the bio-chem and m-m models. the amount of correlation generated is a measure of the impact of the virus on the interactome equilibrium state. we estimate it as the euclidean 1-norm of the correlation vectors g v 1 = i |g v i |, which we refer to as cumulative correlation. the results are presented in fig. 4 . by allowing for multiple sources of perturbation, the biggest responses in magnitude will come from direct neighbors of these sources, making them the dominant contributors to g v 1 . with i i not being dependent on the source degree, these results support the idea that with these specific forms of dynamical processes on the top of the interactome, the overall impact of a perturbation generated by a virus is proportional to the amount of human proteins it interacts with. results shown in fig. 5 highlight that propagation patterns strongly depend on the sources (i.e., the affected nodes v), and strong similarities will generally be found within the same family and for viruses that share common impacted proteins in the interactome. conversely, families and viruses with small (or null) overlap in the sources exhibit low similarity and are not sharply distinguishable. to cope with this, we adopt a rather macroscopic view of the interactomes in the next section. analysis of spectral information. we have shown that the structural properties of human ppi network does not significantly change after being targeted by viruses. percolation analysis seems ineffective in distinguishing the specific characteristics of virus-host interactomes while, in contrast, the propagation of biochemical signals from viral components into human ppi network has been shown successful in assessing the viruses in terms of their average impact on human proteins. remarkably, the propagation patterns can be used to hierarchically cluster the viruses, although some of them are highly dependent on the choice of threshold (fig. 5) . in this section, which is defined in terms of the propagator of a diffusion process on top of the network, normalized by the partition function z(β, g) = tr e −βl , which has an elegant physical meaning in terms of dynamical trapping for diffusive flows 38 . consequently, the counterpart of massieu functionalso known as free entropy -in statistical physics can be defined for networks as note that a low value of the massieu function indicates high information flow between the nodes. the von neumann entropy can be directly derived from the massieu function by encoding the information content of graph g. finally, the difference between von neumann entropy and the massieu function follows where u(β, g) is the counterpart of internal energy in statistical physics. in the following, we use the above quantities to compare the interactomes corresponding to different virus-host interactomes. in fact, as the number of viral nodes is much smaller than the number of human proteins, we model each virus-human interdependent system g as a perturbation of the large human ppi network g (see fig. 6 ). after considering the viral perturbations, the von neumann entropy, massieu function and the energy of the human ppi network change slightly. the magnitude of such perturbations can be calculated as explained in fig. 6 , for von neumann entropy and massieu function, while the perturbation in internal energy follows their difference βδu(β, g) = δs(β, g) − δφ(β, g), according to eq. 7. the parameter β encodes the propagation time in diffusion dynamics, or equivalently an inverse temperature from a thermodynamic perspective, and is used as a resolution parameter tuned to characterize macroscopic perturbations due to node-node interactions at different scales, from short to long range 40 . based on the perturbation values and using k-means algorithm, a widely adopted clustering technique, we group the viruses together (see fig. 6 , tab. 1 and tab. 2). at small scales, sarscov-2 appears in a cluster with a number of other viruses causing respiratory illness, including sars-cov, influenza a and hadv. however, at larger scales, it exhibits more similarity with hiv1, htlv1 and hpv type 16. table 1 : the summary of clustering results at small scales (β ≈ 1 from fig.6 ) is presented. remarkably, at this scale, sars-cov-2 groups with a number of respiratory diseases including sars-cov, influenza a and hadv. fig.6 ) is presented. here, sars-cov-2 shows higher similarity to hiv1, htlv1 and hpv type 16. comparing covid-19 against other viral infections is still a challenge. in fact, various approaches can be adopted to characterize and categorize the complex nature of viruses and their impact on human cells. in this study, we used an approach based on statistical physics to analyze virus-human overview of the data set. it is worth noting that to build the covid-19 virus-host interactions, a different procedure had to be used. in fact, since the sars-cov-2 is too novel we could not find its ppi in the string repository and we have considered, instead, the targets experimentally observed in gordon et al 13 , consisting of 332 human proteins. the remainder of the procedure used to build the virus-host ppi is the same as before. see fig. 1 for summary information about each virus. a key enzyme involved in the process of prostaglandin biosynthesis; ifih1 (interferon induced with helicase c domain 1, ncbi gene id: 64135), encoding mda5, an intracellular sensor of viral rna responsible for triggering the innate immune response: it is fundamental for activating the process of pro-inflammatory response that includes interferons, for this reason it is targeted by several virus families which are able to hinder the innate immune response by evading its specific interferon response. contributions. ag, oa and sb performed numerical experiments and data analysis. mdd conceived and designed the study. all authors wrote the manuscript. the proximal origin of sars-cov-2 the genetic landscape of a cell epidemiologic features and clinical course of patients infected with sars-cov-2 in singapore a trial of lopinavir-ritonavir in adults hospitalized with severe covid-19 remdesivir, lopinavir, emetine, and homoharringtonine inhibit sars-cov-2 replication in vitro network medicine: a network-based approach to human disease focus on the emerging new fields of network physiology and network medicine human symptoms-disease network network medicine approaches to the genetics of complex diseases the human disease network the multiplex network of human diseases network medicine in the age of biomedical big data a sars-cov-2 protein interaction map reveals targets for drug repurposing structural genomics and interactomics of 2019 wuhan novel coronavirus, 2019-ncov, indicate evolutionary conserved functional regions of viral proteins structural analysis of sars-cov-2 and prediction of the human interactome fractional diffusion on the human proteome as an alternative to the multi-organ damage of sars-cov-2 network medicine framework for identifying drug repurposing opportunities for covid-19 predicting potential drug targets and repurposable drugs for covid-19 via a deep generative model for graphs network robustness and fragility: percolation on random graphs introduction to percolation theory error and attack tolerance of complex networks breakdown of the internet under intentional attack identification of functional modules in a ppi network by clique percolation clustering identifying protein complexes from interaction networks based on clique percolation and distance restriction percolation of annotation errors through hierarchically structured protein sequence databases infinite-order percolation and giant fluctuations in a protein interaction network computational analysis of biochemical systems a practical guide for biochemists and molecular biologists propagation of large concentration changes in reversible protein-binding networks an introduction to systems biology quantifying the connectivity of a network: the network correlation function method universality in network dynamics the statistical physics of real-world networks classical information theory of networks the von neumann entropy of networks structural reducibility of multilayer networks spectral entropies as information-theoretic tools for complex network comparison complex networks from classical to quantum enhancing transport properties in interconnected systems without altering their structure scale-resolved analysis of brain functional connectivity networks with spectral entropy unraveling the effects of multiscale network entanglement on disintegration of empirical systems under revision string v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets biogrid: a general repository for interaction datasets the biogrid interaction database: 2019 update gene help: integrated access to genes of genomes in the reference sequence collection competing financial interests. the authors declare no competing financial interests.acknowledgements. the authors thank vera pancaldi for useful discussions. key: cord-191876-03a757gf authors: weinert, andrew; underhill, ngaire; gill, bilal; wicks, ashley title: processing of crowdsourced observations of aircraft in a high performance computing environment date: 2020-08-03 journal: nan doi: nan sha: doc_id: 191876 cord_uid: 03a757gf as unmanned aircraft systems (uass) continue to integrate into the u.s. national airspace system (nas), there is a need to quantify the risk of airborne collisions between unmanned and manned aircraft to support regulation and standards development. both regulators and standards developing organizations have made extensive use of monte carlo collision risk analysis simulations using probabilistic models of aircraft flight. we've previously determined that the observations of manned aircraft by the opensky network, a community network of ground-based sensors, are appropriate to develop models of the low altitude environment. this works overviews the high performance computing workflow designed and deployed on the lincoln laboratory supercomputing center to process 3.9 billion observations of aircraft. we then trained the aircraft models using more than 250,000 flight hours at 5,000 feet above ground level or below. a key feature of the workflow is that all the aircraft observations and supporting datasets are available as open source technologies or been released to the public domain. the continuing integration of unmanned aircraft system (uas) operations into the national airspace system (nas) requires new or updated regulations, policies, and technologies to maintain safe and efficient use of the airspace. to help achieve this, regulatory organizations such as the federal aviation administration (faa) and the international civil aviation organization (icao) mandate the use of collision avoidance systems to minimize the risk of a midair collision (mac) between most manned aircraft (e.g. 14 cfr § 135.180). monte carlo safety simulations and statistical encounter models of aircraft behavior [1] have enabled the faa to develop, assess, and certify systems to mitigate the risk of airborne collisions. these simulations and models are based on observed aircraft behavior and have been used to design, evaluate, and validate collision avoidance systems deployed on manned aircraft worldwide [2] . for assessing the safety of uas operations, the monte carlo simulations need to determine if the uas would be a hazard to manned aircraft. therefore there is an inherent need for models that represent how manned aircraft behave. while various models have been developed for decades, many of these models were not designed to model manned aircraft behavior where uas are likely to operate [3] . in response, new models designed to characterize the low altitude environment are required. in response, we previously identified and determined that the opensky network [4] , a community network of ground-based sensors that observe aircraft equipped with automatic dependent surveillance-broadcast (ads-b) out, would provide sufficient and appropriate data to develop new models [5] . ads-b was initially developed and standardized to enable aircraft to leverage satellite signals for precise tracking and navigation. [6, 7] . however, the previous work did not train any models. this work considered only how aircraft, observed by the opensky network, within the united states and flying between 50 and 5,000 feet above ground level (agl) or less. thus this work does not consider all aircraft, as not all aircraft are equipped with ads-b. the scope of this work was informed by the needs of faa uas integration office, along with the activities of the standards development organizations of astm f38, rtca sc-147, and rtca sc-228. initial scoping discussions were also informed by the uas excom science and research panel (sarp), an organization chartered under the excom senior steering group; however the sarp did not provide a final review of the research. we focused on two objectives identified by the aviation community to support integration of uas into the nas. first to train a generative statistical model of how manned aircraft behavior at low altitudes. and second to estimate the relative frequency that a uas would encounter a specific type of aircraft. these contributions are intended to support current and expected uas safety system development and evaluation and facilitate stakeholder engagement to refine our contributions for policy-related activities. the primary contribution of this paper is the design and evaluation of the high performance computing (hpc) workflow to train models and complete analyses that support the community's objectives. refer to previous work [5, 8] to use the results from this workflow. this paper focus primarily on the use of the lincoln laboratory supercomputing center (llsc) [9] to process billions of aircraft observations in a scalable and efficient manner. we first briefly overview the storage and compute infrastructure of the llsc. the llsc and its predecessors have been widely used to process aircraft tracks and support aviation research for more than a decade. the llsc high-performance computing (hpc) systems have two forms of storage: distributed and central. distributed storage is comprised of the local storage on each of the compute nodes and this storage is typically used for running database applications. central storage is implemented using the opensource lustre parallel file system on a commercial storage array. lustre provides high performance data access to all the compute nodes, while maintaining the appearance of a single filesystem to the user. the lustre filesystem is used in most of the largest supercomputers in the world. specifically, the block size of lustre is 1mb, thus any file created on the llsc will take at least 1mb of space. the processing described in this paper was conducted on the llsc hpc system [9] . the system consists of a variety of hardware platforms, but we specifically developed, executed, and evaluated our software using compute nodes based on dual socket haswell (intel xeon e5-2683 v3 @ 2.0 ghz) processors. each haswell processor has 14 cores and can run two threads per core with the intel hyper-threading technology. the haswell node has 256 gb of memory. this section describes the high performance computing workflow and the results for each step. a shell script was used to download the raw data archives for a given monday from the opensky network. data was organized by day and hour. both the opensky network and our architecture will create a dedicated directory for a given day, such as 2020-06-22. after extracting the raw data archives, up to 24 comma separated value (csv) files will populate the directory; each hour in utc time corresponds to a specific file. however, there are a few cases where not every hour of the day was available. the files contain all the abstracted observations of all aircraft for that given hour. for a specific aircraft, observations are updated at least every ten seconds. for this paper, we downloaded 85 mondays spanning february 2018 to june 2020, totaling 2002 hours. the size of each hourly file was dependent upon the number of active sensors that hour, the time of day, the quantity of aircraft operations, and the diversity of the operations. across a given day, the hourly files can range in size by hundreds of megabytes with the maximum file size between 400 and 600 megabytes. together all the hourly files for a given day currently require about 5-9 gigabytes of storage. we observed that on average the daily storage requirement for 2019 was greater than for 2018. parsing, organizing, and aggregating the raw data for a specific aircraft required high performance computing resources, especially when organizing the data at scale. many aviation use cases require organizing data and building a track corpus for each specific aircraft. yet it was unknown how many unique aircraft were observed in a given hour and if a given hourly file has any observations for a specific aircraft. to efficiently organize the raw data, we need to address these unknowns. we identified unique aircraft by parsing and aggregating the national aircraft registries of the united states, canada, the netherlands, and ireland. registries were processed for each individual year for 2018-2020. all registries specified the registered aircraft's type (e.g. rotorcraft, fixed wing singleengine, etc.), the registration expiration date, and a global unique hex identifier of the transponder equipped on the aircraft. this identifier is known as the icao 24-bit address [10] , with (2 24 -2) unique addresses available worldwide. some of the registries also specified the maximum number of seats for each aircraft. using the registries, we created a four tier directory structure to organize the data. the highest level directory corresponds to the year, such as 2019. the next level was organized by twelve general aircraft type, such as fixed wing single-engine, glider, or rotorcraft. the third directory level was based on the number of seats, with each directory representing a range of seats. a dedicated directory was created for aircraft with an unknown number of seats. the lowest level directory was based on the sorted unique icao 24-bit addresses. for each seat-based directory, up to 1000 icao 24-bit address directories are created. additionally to address that the four aircraft registries do not contain all registered aircraft globally, a second level directory titled "unknown" was created and populated with directories corresponding to each hour of data. the top and bottom level directories remained the same as the known aircraft types. the bottom directories for unknown aircraft are generated at runtime. this hierarchy ensures that there are no more than 1000 directories per level, as recommended by the llsc, while organizing the data to easily enable comparative analysis between years or different types of aircraft. the hierarchy was also sufficiently deep and wide to support efficient parallel process i/o operations across the entire structure. for example, a full directory path for the first three tiers of the directory hierarchy could be: "2020/rotorcraft/seats_001_010/." the directory would contain all the known unique icao 24-bit addresses for rotorcraft with 1-10 seats in 2018. within this directory would be up to 1000 directories, such as "a00c12_a00d20" or "a000d20_a00ecf" this lowest level directory would be used to store all the organized raw data for aircraft with an icao 24-bit address. the first hex value was inclusive, but the second hex value was not inclusive. with a directory structure established, each hourly file was then loaded into memory, parsed, and lightly processed. observations with incomplete or missing position reports were removed, along with any observations outside a user-defined geographic polygon. the default polygon, illustrated by figure 1 , was a convex hull with a buffer of 60 nautical mile around approximately north america, central america, the caribbean, and hawaii. units were also converted to u.s. aviation units. the country polygons were sourced from natural earth, a public domain map dataset [11] . specifically for the 85 mondays across the three years, 2214 directories were generated across the first three tiers of the hierarchy and 802,159 directories were created in total across the entire hierarchy. of these, 770,661 directories were nonempty. the majority of the directories were created within the unknown aircraft type directories. as overviewed by tables 1 and 2, about 3.9 billion raw observations were organized, with about 1.4 billion observations available after filtering. there was a 15% annual percent increase in observations per hour from 2018 to 2019. however, a 50% percent decrease in the average number of observations per hour was observed when comparing 2020 to 2019; this could be attributed to the covid-19 pandemic. this worldwide incident sharply curtailed travel, especially travel between countries. this reduction in travel was reflected in the amount of data filtered using the geospatial polygon. in 2018 and 2019, about 41-44% of observations were filtered based on their location. however, only 27% of observations were filtered for observations from march to june 2020. conversely, the amount of observations removed due to quality control did not significantly vary in 2020, as 26%, 20%, and 25% were removed for 2018, 2019, and 2020. these results were generated using 512 cpus across 2002 tasks, where each task corresponded to a specific hourly file. tasks were uniformly distributed across cpus, a dynamic selfscheduling parallelization approach was not implemented. each task required on average 626 seconds to execute, with a median time of 538 seconds. the maximum and minimum times to complete a task were 2153 and 23 seconds. across all tasks, about 348 hours of total compute time was required to parse and filter the 85 days of data. it is expected that if the geospatial filtering was relaxed and observations from europe were not removed, that the compute time would increase due to increase demands on creating and writing to hourly files for each aircraft. since files were created for every hour for each unique aircraft, tens of millions of small files less than 1 megabyte in size were created. this was problematic as small files typically use a single object storage target, thus serializing access to the data. additionally, in a cluster environment, hundreds or thousands of concurrent, parallel processes accessing small files can lead to significantly large random i/o patterns for file access and generates massive amounts of networks traffic. this results in increased latency for file access, higher network traffic and significantly slows down i/o and consequently causes degradation in overall application performance. while this approach to data organization may provide acceptable performance on a laptop or desktop computer, it was unsuitable for use in a shared, distributed hpc system. in response, we created zip archives for each of the bottom directories. in a new parent directory, we replicated the first three tiers of the directory hierarchy from the previous step. then instead of creating directories based on the icao 24-bit addresses, we archiving each directory with the hourly csv files from the previous organization step. we then removed the hourly csv files from storage. this was achieved using llmapreduce [12] , with a task created for each of the 770,661 non-empty bottom level directories. similar to the previous organization step, all tasks were completed in a few hours but with no optimization for load balancing. the performance of this step could be improved by distributing tasks based on the number of files in the directories or the estimated size the output archive. a key advantage to archiving the organized data, is that the archives can be updated with new data as it becomes available. if the geospatial filtering parameters and aircraft registry data doesn't change, only new open sky data needs to be organized. once organized into individual csv files, llmapreduce can be used again to update the existing archives. this substantially reduces the computational and storage requirements to process new data. the archived data can now be segmented, have outliers removed, and interpolated. additionally above ground level altitude was calculated, airspace class was identified, and dynamic rates (e.g. vertical rate) were calculated. we also split the raw data into track segments based on unique position updates and time between updates. this ensures that each segment does not include significantly interpolated or extrapolated observations. track segments without ten points are removed. figure 2 illustrates the track segments for a faa registered fixed wing multi-engine aircraft from march to june 2020. note that segment length can vary from tens to hundreds of nautical miles long. track segment length was dependent upon the aircraft type, availability of active opensky network sensors, and nearby terrain. however, the ability to generate track segments that span multiple states represents a substantial improvement over previous processing approaches for development of aircraft behavior models. then for each segment we detect altitude outliers using a 1.5 scaled median absolute deviations approach and smooth the track using a gaussian-weight average filter with a 30-second time window. dynamic rates, such as acceleration, are calculated using a numerical gradient. outliers are then detected and removed based on these rates. outlier thresholds were based on aircraft type. for example, the speeds greater than 250 knots were considered outliers for rotorcraft, but fixed wing multiengine aircraft had a threshold of 600 knots. the tracks were then interpolated to a regular one second interval. lastly, we estimated the above ground level altitude using digital elevation models. this altitude estimation was the most computationally intensive component of the entire workflow. it consists of loading into memory and interpolating srtm3 or noaa globe [13] digital elevation models (dems) to determine the elevation for each interpolated track segment position. to reduce the computational load prior to processing the terrain data, it was determined using a c++ based polygon test to identify which track segment positions are over the ocean, as defined by natural earth data. points are over the ocean are assumed to have an elevation of 0 feet mean sea level and their elevation are not estimated using the dems. for the 85 days of organized data, approximately 900,000,000 interpolated track segments were generated. for each aircraft in a given year, a single csv was generated containing all the computed segments. in total across the three years, 619,337 files were generated. as these files contained significantly more rows and columns than when organizing the raw data, the majority of these final files were greater than 1 mb in size. the output of this step did not face any significant storage block size challenges. similar to the previous step, tasks were created based on the bottom tier of the directory hierarchy. specifically for processing, parallel tasks were created for each archive. during processing, archives were extracted to a temporary directory while the final output was stored in standard memory. given the processed data, this section overviews two applications on how to exploit and dissemination the data to inform and support the aviation safety community. as the aircraft type was identified when organizing the raw data, it was a straightforward task to estimate the observed distribution of aircraft types per hour. these distributions are not reflective of all aircraft operations in the united states, as not all aircraft are observed by the opensky network. the distributions were also calculated independently for each aircraft type, so the yearly (row) percentages may not sum to 100%. furthermore the relatively low percentage of unknown aircraft was due to the geospatial filtering when organizing the raw data. if the same aircraft registries were used by the filtering was change to only include tracks in europe, the percentage of unknown aircraft would likely significantly rise. this analysis can be extended by identifying specific aircraft manufactures and models, such as boeing 777. however, the manufacturer and model information are not consistent within an aircraft registry nor across different registries. for example, entries of "cessna 172," "textron cessna 172," and "textron c172" all refer to the same aircraft model. one possible explanation for the differences between entries is that cessna used to be an independent aircraft manufacturer and then eventually was acquired by textron. depending on the year of registration, the name of the aircraft may differ but the size and performance of the aircraft remains constant. since over 300,000 aircraft with unique icao 24-bit addresses were identified annually across the aircraft registries, parsing and organizing the aircraft models can be formulated as a traditional natural language processing problem. parsing the aircraft registries differs from a common problem of parsing aviation incident or safety reports [14, 15, 16] due to the reduced word count of the registries and the structured format of the registries. future work will focus on using fuzzy string matching to identify similar aircraft. for many aviation safety studies, manned aircraft behavior is represented using mit lincoln laboratory encounter models. each encounter model is a bayesian network, a generative statistical model that mathematically represents aircraft behavior during close or safety critical encounters, such as near midair collisions. the development of the modern models started in 2008 [1] , with significant updates in 2013 [17] and 2018 [18] . all the models were trained using the llsc [9] or its predecessors. the most widely used of these models were trained using observations collected by groundbased secondary surveillance radars from the 84th radar evaluation squadron (rades) network. aircraft observations by the rades network are based on mode 3a/c, an identification friend or foe technology that provides less metadata than ads-b. notably aircraft type or model cannot be explicitly correlated or identified with specific aircraft tracks. instead, we filtered the rades observations based on the flying rules reported by the aircraft. however, this type of filtering is not unique to the rades data, it is also supported by the opensky network data. additionally, due to the performance of the rades sensors, we filtered out any observations below 500 feet agl due to position uncertainties associated with radar time of arrival measurements. observations of ads-b equipped aircraft by the opensky network differ because ads-b enables aircraft to broadcast the aircraft's estimate of their own location, which is often based on precise gnss measurements. the improved position reporting of ads-b enabled the new opensky network-based models to be trained with an altitude floor of 50 feet agl, instead of 500. specifically, three new statistical models of aircraft behavior were trained, each for a different aircraft type of fixed wing multi-engine, fixed wing single-engine, and rotorcraft. a key advantage to these models is the data reduction and dimensionality reduction. a model was created for each of the three aircraft types and stored as a human readable text file. each file requires approximately just 0.5 megabytes. this a significant reduction from the hundreds of gigabytes used to store the original 85 days of data. table iv reports the quantity of data used to train each model. for example, the rotorcraft model was trained from about 25,000 flight hours over 85 days. however, like the rades-based model, these models do not represent the geospatial nor temporal distribution of the training data. for example, a limitation of these models is that they do not inform if more aircraft were observed in new york city than los angeles. [17] . these figures illustrate how different aircraft behave, such as rotorcraft flying relatively lower and slower than fixed wing multi-engine aircraft. also note that the rades-based model has no altitude observations below 500 feet agl, whereas 18% of the approximately 25,000 rotorcraft flight hours were observed at 50-500 feet agl. it has not been assessed if the opensky network-based models can be used a surrogates for other aircraft types or operations. additionally the new models do not fully supersede the existing rades-based models, as each models represent different varieties of aircraft behavior. on github.com, please refer to the mit lincoln laboratory (@mit-ll) and airspace encounter models (@airspace-encounter-models) organizations. airspace encounter models for estimating collision safety analysis of upgrading to tcas version 7.1 using the 2008 u.s. correlated encounter model well-clear recommendation for small unmanned aircraft systems based on unmitigated collision risk bringing up opensky: a large-scale ads-b sensor network for research developing a low altitude manned encounter model using ads-b observations vision on aviation surveillance systems ads-mode s: initial system description representative small uas trajectories for encounter modeling interactive supercomputing on 40,000 cores for machine learning and data analysis mode s: an introduction and overview (secondary surveillance radar) introducing natural earth datanaturalearthdata.com llmapreduce: multi-level map-reduce for high performance data analysis the global land one-kilometer base elevation (globe) digital elevation model, version 1.0 using structural topic modeling to identify latent topics and trends in aviation incident reports temporal topic modeling applied to aviation safety reports: a subject matter expert review ontologies for aviation data management ieee/aiaa 35th digital avionics systems conference (dasc) uncorrelated encounter model of the national airspace system, version 2.0 correlated encounter model for cooperative aircraft in the national airspace system version 2.0 we greatly appreciate the support and assistance provided by sabrina saunders-hodge, richard lin, and adam hendrickson from the federal aviation administration. we also would like to thank fellow colleagues dr. rodney cole, matt edwards, and wes olson. key: cord-003297-fewy8y4a authors: wang, ming-yang; liang, jing-wei; mohamed olounfeh, kamara; sun, qi; zhao, nan; meng, fan-hao title: a comprehensive in silico method to study the qstr of the aconitine alkaloids for designing novel drugs date: 2018-09-18 journal: molecules doi: 10.3390/molecules23092385 sha: doc_id: 3297 cord_uid: fewy8y4a a combined in silico method was developed to predict potential protein targets that are involved in cardiotoxicity induced by aconitine alkaloids and to study the quantitative structure–toxicity relationship (qstr) of these compounds. for the prediction research, a protein-protein interaction (ppi) network was built from the extraction of useful information about protein interactions connected with aconitine cardiotoxicity, based on nearly a decade of literature and the string database. the software cytoscape and the pharmmapper server were utilized to screen for essential proteins in the constructed network. the calcium-calmodulin-dependent protein kinase ii alpha (camk2a) and gamma (camk2g) were identified as potential targets. to obtain a deeper insight on the relationship between the toxicity and the structure of aconitine alkaloids, the present study utilized qsar models built in sybyl software that possess internal robustness and external high predictions. the molecular dynamics simulation carried out here have demonstrated that aconitine alkaloids possess binding stability for the receptor camk2g. in conclusion, this comprehensive method will serve as a tool for following a structural modification of the aconitine alkaloids and lead to a better insight into the cardiotoxicity induced by the compounds that have similar structures to its derivatives. the rhizomes and roots of aconitine species, a genus of the family ranunculaceae, are commonly used in treatment for various illnesses such as collapse, syncope, rheumatic fever, joints pain, gastroenteritis, diarrhea, edema, bronchial asthma, and tumors. they are also involved in the management of endocrinal disorders such as irregular menstruation [1, 2] . however, the usefulness of this aconitine species component intermingles with toxicity after it is administered to a diseased patient. so far, few articles have recorded the misuse of aconitine medicinals with strong emphasis and thus have referenced that the misuse of this medicinal can result in severe cardio-and neurotoxicity [3] [4] [5] [6] [7] . in our past research, it was evidenced that the aconitine component is the main active ingredient in this species' root and rhizome, and is responsible for both therapeutic and toxic effects [8] . this medicinal has been tested for cancerological and dermatological activities. its application to disease conditions proved to exhibit an activity that slowed down cancer tumor growth and to cure serious cases of dermatosis. it was also found to have an effect on postoperative analgesia [9] [10] [11] [12] . however, a previous safety study has revealed that aconitine toxicity is responsible for its restriction in clinical settings. further studies are needed to explain the cause of aconitine toxicity as well as to show whether the toxicity supersedes its usefulness. a combined network analysis and in silico study was once performed to obtain insight on the relationship between aconitine alkaloid toxicity and the aconitine structure, and it was found that the cardiotoxicity of aconitine is the primary cause of patient death. the aconitine poison is similar to the poison created by some pivotal proteins such as the ryanodine receptor (ryr1 and ryr2), the gap junction α-1 protein (gja1), and the sodium-calcium exchanger (slc8a1) [9] [10] [11] [12] . however, among all existing studies about the aconitine medicinal, no one has reported detail of its specific binding target protein linked to toxicity. protein-protein interactions (ppis) participate in many metabolic processes occurring in living organisms such as the cellular communication, immunological response, and gene expression control [13, 14] . a systematic description of these interactions aids in the elucidation of interrelationships among targets. the targeting of ppis with small-molecule compounds is becoming an essential step in a mechanism study [14] . the present study was designed and undertaken to identify the critical protein that can affect the cardiotoxicity of aconitine alkaloids. a ppi network built by the string database is a physiological contact for the high specificity that has been established for several protein molecules and has stemmed from computational prediction, knowledge transfer between organisms, and interactions aggregated from other databases [15] . the analysis of the ppi network is based on nodes and edges and is always performed via cluster analysis and centrality measurements [16, 17] . in cluster analysis, highly interconnected nodes and protein target nodes are divided and used to form sub-graphs. the reliability of the ppi network is identified by the content of each sub-graph [18] . the variability in centrality measurements is connected to the quantitative relationship between the protein targets and its weightiness in the network [18] . hence, ppi networks with protein targets related to aconitine alkaloid cardiotoxicity must enable us to find the most relevant protein for aconitine toxicity and to understand the mechanism at the network level. in our research, the evaluation and visualization analysis of essential proteins related to cardiotoxicity in ppis were performed by the clusterone and cytonca plugins in cytoscape 3.5, designed to find the potential protein targets via combination with conventional integrated pharmacophore matching technology built in the pharmmapper platform. structural modification of a familiar natural product, active compound, or clinical drug is an efficient method for designing a novel drug. the main purpose of the structural modification is to reduce the toxicity of the target compound while enhancing the utility of the drug [19] . the identification of the structure-function relationship is an essential step in the drug discovery and design, the determination of the 3d protein structures was the key step in identifying the internal interactions in the ligand-receptor complexes. x-ray crystallography and nmr were the only accepted techniques of determining the 3d protein structure. although the 3d structure obtained by these two powerful techniques are accurate and reliable, they are time-consuming and costly [20] [21] [22] [23] [24] . with the rapid development of structural bioinformatics and computer-aided drug design (cadd) techniques in the last decade, computational structures are becoming increasingly reliable. the application of structural bioinformatics and cadd techniques can improve the efficiency of this process [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] . the ligand-based quantitative structure-toxicity relationship (qstr) and receptor-based docking technology are regarded as effective and useful tools in analysis of structure-function relationships [35] [36] [37] [38] . the contour maps around aconitine alkaloids generated by comparative molecular field analysis (comfa) and comparative molecular similarity index analysis (comsia) were combined with the interactions between ligand substituents and amino acids obtained from docking results to gain insight on the relationship between the structure of aconitine alkaloids and their toxicity. scores from functions were used to evaluate the docking result. the value-of-fit score in moe software reflects the binding stability and affinity of the ligand-receptor complexes. when screening for the most potential target for cardiotoxicity, the experimental data was combined with the value-of-fit score by the ndcg (normalized discounted cumulative gain). the possibility of a protein being a target of cardiotoxicity corresponds with the consistency of this experimental data. since the pioneering paper entitled "the biological functions of low-frequency phonons" [39] was published in 1977, many investigations of biomacromolecules from a dynamic point of view have occurred. these studies have suggested that low-frequency (or terahertz frequency) collective motions do exist in proteins and dna [40] [41] [42] [43] [44] . furthermore, many important biological functions in proteins and dna and their dynamic mechanisms, such as cooperative effects [45] , the intercalation of drugs into dna [42] , and the assembly of microtubules [46] , have been revealed by studying the low-frequency internal motions, as summarized in a comprehensive review [40] . some scientists have even applied this kind of low-frequency internal motion to medical treatments [47, 48] . investigation of the internal motion in biomacromolecules and its biological functions is deemed as a "genuinely new frontier in biological physics," as announced in the mission of some biotech companies (see, e.g., vermont photonics). in order to consider the static structural information of the ligand-receptor complex, dynamical information should be also considered in the process of drug discovery [49, 50] . finally, molecular dynamics was carried out to verify the binding affinity and stability of aconitine alkaloids and the most potential target. this present study may be instrumental in our future studies for the synergism and attenuation of aconitine alkaloids and for the exploitation of its clinical application potential. a flowchart of procedures in our study is shown in figure 1 . of-fit score by the ndcg (normalized discounted cumulative gain). the possibility of a protein being a target of cardiotoxicity corresponds with the consistency of this experimental data. since the pioneering paper entitled "the biological functions of low-frequency phonons" [39] was published in 1977, many investigations of biomacromolecules from a dynamic point of view have occurred. these studies have suggested that low-frequency (or terahertz frequency) collective motions do exist in proteins and dna [40] [41] [42] [43] [44] . furthermore, many important biological functions in proteins and dna and their dynamic mechanisms, such as cooperative effects [45] , the intercalation of drugs into dna [42] , and the assembly of microtubules [46] , have been revealed by studying the low-frequency internal motions, as summarized in a comprehensive review [40] . some scientists have even applied this kind of low-frequency internal motion to medical treatments [47, 48] . investigation of the internal motion in biomacromolecules and its biological functions is deemed as a "genuinely new frontier in biological physics," as announced in the mission of some biotech companies (see, e.g., vermont photonics). in order to consider the static structural information of the ligand-receptor complex, dynamical information should be also considered in the process of drug discovery [49, 50] . finally, molecular dynamics was carried out to verify the binding affinity and stability of aconitine alkaloids and the most potential target. this present study may be instrumental in our future studies for the synergism and attenuation of aconitine alkaloids and for the exploitation of its clinical application potential. a flowchart of procedures in our study is shown in figure 1 . the whole framework of the comprehensive in silico method for screening potential targets and studying the quantitative structure-toxicity relationship (qstr). the 33 compounds were aligned over, under the superimposition of the common moiety and template compound 6. the statistical parameters for database alignment-q 2 , r 2 , f, and see-were the whole framework of the comprehensive in silico method for screening potential targets and studying the quantitative structure-toxicity relationship (qstr). the 33 compounds were aligned over, under the superimposition of the common moiety and template compound 6. the statistical parameters for database alignment-q 2 , r 2 , f, and see-were table 1 . the comfa model with the optimal number of 6 components presented a q 2 of 0.624, an r 2 of 0.966, an f of 124.127, and an see of 0.043, and contributions of the steric and electrostatic fields were 0.621 and 0.379, respectively. the comsia model with the optimal number of 4 components presented a q 2 of 0.719, an r 2 of 0.901, an f of 157.458, and an see of 0.116, and the contributions of steric, electrostatic, hydrophobic, hydrogen bond acceptor, and hydrogen bond donor fields were 0.120, 0.204, 0.327, 0.216, and 0.133, respectively. the statistical results proved that the aconitine alkaloids qstr model of comfa and comsia under the database alignment have adequate predictability. experimental and predicted pld 50 values of both the training set and test set are shown in figure 2 , and the comfa ( figure 2a ) and comsia ( figure 2b ) model gave the correlation coefficient (r 2 ) value of 0.9698 and 0.977, respectively, which demonstrated the internal robustness and external high prediction of the qstr models. experimental and predicted pld50 values of both the training set and test set are shown in figure 2 residuals vs. leverage williams plots of the aconitine qstr models are shown in figure 3a ,b. all values of standardized residuals fall between 3σ and −3σ, and the values of leverage are less than h*, so the two models demonstrate potent extensibility and predictability. residuals vs. leverage williams plots of the aconitine qstr models are shown in figure 3a ,b. all values of standardized residuals fall between 3σ and −3σ, and the values of leverage are less than h*, so the two models demonstrate potent extensibility and predictability. under mesh (medical subject headings), a total of 491 articles (261 articles were received from web of science, and others were received from pubmed) were retrieved. after selecting cardiotoxicity-related and excluding repetitive articles, 274 articles were used to extract the correlative proteins and pathways for building a ppi network in the string server. the correlative proteins or pathways are shown in table 2 . all proteins were taken as input protein in the string database to find its direct and functional partners [51] , and proteins and its partners were then imported into the cytoscape 3.5 to generate the ppi network with 148 nodes and 872 edges ( figure 4 ). potassium voltage-gated channel h2 7 scn3a sodium voltage-gated channel type 3, 3 scn2a sodium voltage-gated channel type 2 3 scn8a sodium voltage-gated channel type 8 2 scn1a sodium voltage-gated channel type 1 2 scn4a sodium voltage-gated channel type 4 1 kcnj3 potassium inwardly-rectifying channel j3 1 during the case of screening of the essential proteins in ppi network, three centrality measurements (subgraph centrality, betweenness centrality, and closeness centrality) in cytonca were utilized to evaluate the weight of nodes. after removing the central node "ac," the centrality measurements of 147 nodes were calculated by cytonca and documented in table s1 . the top 10% of three centrality measurement values of all node are painted with a different color in figure 4a . to screen the node with the high values of each three centrality measures, nodes with three colors were overlapped and merged into sub-networks in figure 4b . under mesh (medical subject headings), a total of 491 articles (261 articles were received from web of science, and others were received from pubmed) were retrieved. after selecting cardiotoxicity-related and excluding repetitive articles, 274 articles were used to extract the correlative proteins and pathways for building a ppi network in the string server. the correlative proteins or pathways are shown in table 2 . all proteins were taken as input protein in the string database to find its direct and functional partners [51] , and proteins and its partners were then imported into the cytoscape 3.5 to generate the ppi network with 148 nodes and 872 edges ( figure 4 ). table 2 . proteins related to aconitine alkaloids induced cardiotoxicity extracted from 274 articles. classification frequency ryanodine receptor 2 19 ryr1 ryanodine receptor 1 15 gja1 gap junction α-1 protein (connexin43) 13 slc8a1 sodium/calcium exchanger 1 11 atp2a1 calcium transporting atpase fast twitch 1 9 kcnh2 potassium voltage-gated channel h2 7 scn3a sodium voltage-gated channel type 3, 3 scn2a sodium voltage-gated channel type 2 3 scn8a sodium voltage-gated channel type 8 2 scn1a sodium voltage-gated channel type 1 2 scn4a sodium voltage-gated channel type 4 1 kcnj3 potassium inwardly-rectifying channel j3 1 during the case of screening of the essential proteins in ppi network, three centrality measurements (subgraph centrality, betweenness centrality, and closeness centrality) in cytonca were utilized to evaluate the weight of nodes. after removing the central node "ac," the centrality measurements of 147 nodes were calculated by cytonca and documented in table s1 . the top 10% of three centrality measurement values of all node are painted with a different color in figure 4a . to screen the node with the high values of each three centrality measures, nodes with three colors were overlapped and merged into sub-networks in figure 4b . in the sub-networks, the voltage-gated calcium and sodium channel accounted for a large proportion, which is consistent with our research in clustering the network (clusters 1, 2, and 9). all proteins in the sub-networks will be utilized to predict the results of the pharmmapper server to receive the potential target of cardiotoxicity induced by aconitine alkaloids (in figure 5a ,b). in the meantime, 2v7o (camk2g) and 2vz6 (camk2a) were identified as the potential targets with higher fit scores. in the sub-networks, the voltage-gated calcium and sodium channel accounted for a large proportion, which is consistent with our research in clustering the network (clusters 1, 2, and 9). all proteins in the sub-networks will be utilized to predict the results of the pharmmapper server to receive the potential target of cardiotoxicity induced by aconitine alkaloids (in figure 5a ,b). in the meantime, 2v7o (camk2g) and 2vz6 (camk2a) were identified as the potential targets with higher fit scores. all compounds were docked into three potential targets. the values of ndcg are shown in table 3 . the dock study of three proteins with an ndcg of 0.8503 and 0.9122, respectively (the detailed in the sub-networks, the voltage-gated calcium and sodium channel accounted for a large proportion, which is consistent with our research in clustering the network (clusters 1, 2, and 9). all proteins in the sub-networks will be utilized to predict the results of the pharmmapper server to receive the potential target of cardiotoxicity induced by aconitine alkaloids (in figure 5a ,b). in the meantime, 2v7o (camk2g) and 2vz6 (camk2a) were identified as the potential targets with higher fit scores. all compounds were docked into three potential targets. the values of ndcg are shown in table 3 . the dock study of three proteins with an ndcg of 0.8503 and 0.9122, respectively (the detailed all compounds were docked into three potential targets. the values of ndcg are shown in table 3 . the dock study of three proteins with an ndcg of 0.8503 and 0.9122, respectively (the detailed docking result is shown in table s2 ) proves that the result of the dock study of 2v7o is consistent with the experimental pld 50 , so the protein 2v7o was utilized for the ligand interaction analysis. table 3 . ranking results by experimental and predicted pld 50 and fit score. experimental pld 50 fit score (2v7o) fit score (2vz6) 6 1 3 3 20 2 1 12 12 3 4 9 1 4 2 4 11 5 7 2 14 6 8 13 16 7 5 6 7 8 17 15 8 9 10 11 27 10 23 17 13 11 12 19 15 12 11 5 32 13 18 18 5 14 22 8 33 15 13 29 21 16 15 1 25 17 9 20 22 18 25 25 17 19 20 16 28 20 24 30 9 21 16 32 29 22 32 14 2 23 30 24 30 24 31 26 18 25 21 27 10 26 26 21 23 27 29 31 31 28 33 7 26 29 14 23 4 30 28 33 3 31 6 10 19 32 27 28 24 33 19 22 ndcg 1 0.9122 0.8503 the 3d-qstr contour maps were utilized to visualize the information on the comfa and comsia model properties in three-dimensional space. these maps used characteristics of compounds that are crucial for activity and display the regions around molecules where the variance of activities is expected based on physicochemical property changes in molecules [52] . the analysis of favorable and unfavorable regions of steric, electrostatic, hydrophobic, hbd, and hba atom fields contribute to the realization of the relationship between the aconitine alkaloid's toxic activity and its structure. steric and electrostatic contour maps of the comfa qstr model are shown in figure 4a ,b, respectively. hydrophobic, hbd, and hba contour maps of the comsia qstr model are shown in figure 4c -e. compound 6 has the most toxic activity, so it was chosen as the reference structure for the generation of the comfa and comsia contour maps. in the case of the comfa study, the steric contour map around compound 6 is shown in figure 6a . the yellow regions near r2, r7, and r6 showed the substituents of the molecule, which proved that these positions were not ideal for sterically favorable functional groups. therefore, compounds 19, 24, and 26 (with pld 50 values of 1.17, 0.84, and 1.82, respectively), which consist of sterically esterified moieties at positions r2 and r7, were less toxic than compounds 6 and 20 (with pld 50 values of 5.00 and 4.95), which were substituted by a small hydroxyl group, and compound 3 (with a pld 50 value of 1.44) has less toxic activity due to the esterified moiety in r6. the green regions, sterically favorable the comfa electrostatic contour map is shown in figure 6b . the blue region near the r2 and r7 substitution revealed that the replacement of electropositive groups is in favor of toxicity. this can be proven by the fact that the compounds with hydroxy in these two positions had higher pld 50 values than the compound with acetoxy or no substituents. the red region surrounding molecular scaffolds was not distinct, which revealed that there was no connection between the electronegative and the toxicity. the comsia hydrophobic contour map is shown in figure 6c . the r2, r6, and r7 around the white region indicated that the hydrophobic groups were unfavorable for the toxicity, so the esterification of hydrophilic hydroxyl or dehydroxylation decreased the toxicity, which is consistent with the steric and electrostatic contour map. the yellow contour map near the r12 manifested that the hydrophilic hydroxy was unfavorable to the toxicity, which can be validated by the fact that aconitine alkaloids with hydroxy substituents in r12 (compound 10, with a pld 50 the ppi network of aconitine alkaloids cardiotoxicity was divided into nine clusters using clusterone. statistical parameters are shown in figure 5 . six clusters, namely clusters 1, 3, 4, 5, 7, and 9, which possess quality scores higher than 0.5, a density higher than 0.45, and a p-value less than 0.05, were selected for further analysis (in figure 7) . clusters 1, 4, and 7 consisted of proteins mainly involved in the effects of various calcium, potassium, and sodium channels. cluster 1 mainly the comsia contour map of hbd is shown in figure 6d . the cyan regions at r2, r6, and r7 represented a favorable condition for the hbd atom, which clearly validated the fact that the compounds with hydroxy in this region show potent toxicity. a purple region was found near r12, which proved that the hbd atom (hydroxyl) in this region has an adverse effect on toxicity. the hba contour map is shown in figure 6 . the magenta region around r1 substitution proved that this substitution was favorable to the hba atom, so compounds 13, 15, 32, and 33 with the hba atom in the r1 substitution exhibit more potent toxicity (with pld 50 values of 3.52, 3.30, 3.16, and 2.84) than compounds with methoxymethyl substituents (compounds 19, 24, and 26 with pld 50 values of 1.17, 0.84, and 1.82). the red contour map where hba atoms are unfavorable for the toxicity was positioned around r2 and r6. these contours were well validated by the lower pld 50 value of compounds with carbonyl in these substitutions. the ppi network of aconitine alkaloids cardiotoxicity was divided into nine clusters using clusterone. statistical parameters are shown in figure 5 . six clusters, namely clusters 1, 3, 4, 5, 7, and 9, which possess quality scores higher than 0.5, a density higher than 0.45, and a p-value less than 0.05, were selected for further analysis (in figure 7) . clusters 1, 4, and 7 consisted of proteins mainly involved in the effects of various calcium, potassium, and sodium channels. cluster 1 mainly consisted of three channel types related to the cardiotoxicity of aconitine alkaloids, cluster 4 contained calcium and sodium channels and some channel exchangers (such as ryr1 and ryr2), and cluster 7 mainly consisted of various potassium channels. all of these findings are consistent with previous research about the arrhythmogenic properties of the toxicity of aconitine alkaloids: the aconitine binds to ion channels and affects their open state, and thus the corresponding ion influx into the cytosol [53] [54] [55] . the channel exchangers play a crucial role in keeping the ion transportation and homeostasis inside and outside of the cell. cluster 9 contained some regulatory proteins that can activate or repress the ion channels through the protein expression level. atp2a1, ryr2, ryr1, cacna1c, cacna1d, and cacna1s mediate the release of calcium, thereby playing a key role in triggering cardiac muscle contraction and maintaining the calcium homeostasis [56, 57] . aconitine may cause aberrant channel activation and lead to cardiac arrhythmia. clusters 3 and 5 consisted of camp-dependent protein kinase (capk), cgmp-dependent protein kinase (cgpk), and guanine nucleotide binding protein (g protein). they have not been fully studied to prove whether the cardiotoxicity induced by aconitine alkaloids is linked to the capk, cgpk, and g proteins; however, some studies have shown that cardiotoxicity-related protein kcnj3 (potassium inwardly-rectifying channel) is controlled by g proteins and the cardiac sodium/calcium exchanger and is said to be regulated by capk and cgpk [58, 59] . the result of clusterone indicated that the constructed network is consistent with existing studies and that the network can be used to screen essential proteins in the cytonca plugin. the protein 2v7o belonging to the camkii (calcium/calmodulin (ca 2+ /cam)-dependent serine/threonine kinases ii) isozyme protein family plays a central role in cellular signaling by transmitting ca 2+ signals. the camkii enzymes transmit calcium ion (ca 2+ ) signals released inside the cell by regulating signal transduction pathways through phosphorylation. ca 2+ first binds to the small regulatory protein cam, and this ca 2+ /cam complex then binds to and activates the kinase, which then phosphorylates other proteins such as ryanodine receptor and sodium/calcium exchanger. thus, these proteins are related to the cardiotoxicity induced by aconitine alkaloids [60] [61] [62] . the excessive activity of camkii has been observed in some structural heart disease and arrhythmias [63] , and past findings demonstrate neuroprotection in neuronal cultures treated with inhibitors of camkii immediately prior to excitotoxic activation of the camkii [64] . the acute cardiotoxicity of the aconitine alkaloids is possibly related to this target. based on the analysis of the ppi network above, camkii was selected as the potential target for further molecular docking and dynamic simulation. the dock result of 2v7o is shown in figure 8a . compound 20 has the highest fit scores, so it was selected as the template for conformational analysis. the mechanisms of camkii activation and inactivation are shown in figure 8b . compound 20 affects the normal energy metabolism of the myocardial cell via binding in the atp-competitive site in figure 8c . the inactive state of the camkii was regulated by cask-mediated t306/t307 phosphorylation, and this state can be inhibited by the binding of compound 20 in the atp-competitive site. such binding moves camkii toward a ca 2+ /cam-dependent activation active state and a ca 2+ /cam-dependent activation through structural rearrangement of the inhibitory helix caused by ca 2+ /cam binding and the subsequent autophosphorylation of t287 [65] , which will induce the excessive activity of camkii and dynamic imbalance of the calcium ions in the myocardial cell, eventually leading to heart disease and arrhythmias. molecules 2018, 23, x for peer review 10 of 24 channel) is controlled by g proteins and the cardiac sodium/calcium exchanger and is said to be regulated by capk and cgpk [58, 59] . the result of clusterone indicated that the constructed network is consistent with existing studies and that the network can be used to screen essential proteins in the cytonca plugin. the protein 2v7o belonging to the camkii (calcium/calmodulin (ca 2+ /cam)-dependent serine/threonine kinases ii) isozyme protein family plays a central role in cellular signaling by transmitting ca 2+ signals. the camkii enzymes transmit calcium ion (ca 2+ ) signals released inside the cell by regulating signal transduction pathways through phosphorylation. ca 2+ first binds to the small regulatory protein cam, and this ca 2+ /cam complex then binds to and activates the kinase, which then phosphorylates other proteins such as ryanodine receptor and sodium/calcium the information of a binding pocket of a receptor for its ligand is very important for drug design, particularly for conducting mutagenesis studies [28] . as has been reported in the past [66] , the binding pocket of a protein receptor to a ligand is usually defined by those residues that have at least one heavy atom within a distance of 5 å from a heavy atom of the ligand. such a criterion was originally used to define the binding pocket of atp in the cdk5-nck5a complex [20] , which was later proved to be very useful in identifying functional domains and stimulating the relevant truncation experiments. a similar approach has also been used to define the binding pockets of many other receptor-ligand interactions important for drug design [30, 31, 33, [67] [68] [69] [70] . the information of a binding pocket of camkii for the aconitine alkaloids will serve as a guideline for designing drugs with similar scaffolds, particularly for conducting mutagenesis studies. in figure 8a , four top fit scores-compounds 1, 6, 12, and 20-generated similar significant interactions with amino acid residues around the atp-competitive binding pocket. four compounds formed with many van der waals interactions within the noncompetitive inhibitor pocket through amino acid residues such as asp157, lys43, glu140, lys22, and leu143. the ligand-receptor interaction showed that the hydroxy in r2 formed a side chain donor interaction with asp157. in addition, the hydroxy in r6 and r7 also formed a side chain acceptor interaction with glu140 and ser26, respectively (the docking result of compounds 6 and 12 in figure 8a ). these results correspond to the comfa and comsia contour maps. however, the small electropositive and hydrophilic group in r2, r6, and r7 possess a certain enhancement function to toxicity. there were aromatic interactions between the phenyl group in r9 and amino acid residues. the phenyl group in r9 formed aromatic interactions with leu20, leu142, and phe90, while the small group hydroxyl did not form any interaction with asp91, which demonstrate that bulky phenyl group is crucial to this binding pattern and toxicity. this was mainly equal to the comfa steric contour map, where r9 was ideal for sterically favorable groups. the methoxymethyl r1 generated backbone acceptor with lys43, which correspond to the comsia hba contour map, where r1 was favorable for the hba atom. compound 20 docked into 2v7o, and the atp-competitive pocket was painted green; the t287, t307, and t308 phosphorylation sites were painted green, orange, and yellow, respectively; the inhibitory helix was painted red. the result of md simulation is shown in figure 9 . the red plot represented the rmsd values of the docked protein. the values of rmsd reached 2.41 å in 1.4 ns and then remained between 2 and 2.5 å throughout the simulation for up to 5 ns. the averaged value of the rmsd was 2.06 å. the md simulation demonstrated that the ligand was stabilized in the active site. finally, we combined the ligand-based 3d-qstr analysis with the structure-based molecular docking study to identify the necessary moiety related to the cardiotoxicity mechanism of the aconitine alkaloids (in figure 10 ). finally, we combined the ligand-based 3d-qstr analysis with the structure-based molecular docking study to identify the necessary moiety related to the cardiotoxicity mechanism of the aconitine alkaloids (in figure 10 ). to build the ppi network of aconitine alkaloids, literature from 1 january 2007 to 31 february 2017 was retrieved from pubmed (http://pubmed.cn/) and web of science (http://www.isiknowledge.com/) with the mesh word "aconitine" and "toxicity" and without language restriction. all documents about cardiotoxicity caused by aconitine alkaloids were collected. the proteins related to the aconitine alkaloids cardiotoxicity of this decade were gathered and taken as the input protein in the string (https://string-db.org/) database [51, 71] , used to search for related proteins or pathways that had been reported. finally, all the proteins and its partners were recorded in excel in order to import information and build a ppi network in cytoscape software. cytoscape is a free, open-source, java application for visualizing molecular networks and integrating them with gene expression profiles [71, 72] . plugins are available for network and molecular profiling analyses, new layouts, additional file format support, making connections with figure 10 . crucial requirement of cardiotoxicity mechanism was obtained from the ligand-based 3d-qstr and structure-based molecular docking study. to build the ppi network of aconitine alkaloids, literature from 1 january 2007 to 31 february 2017 was retrieved from pubmed (http://pubmed.cn/) and web of science (http://www.isiknowledge.com/) with the mesh word "aconitine" and "toxicity" and without language restriction. all documents about cardiotoxicity caused by aconitine alkaloids were collected. the proteins related to the aconitine alkaloids cardiotoxicity of this decade were gathered and taken as the input protein in the string (https://string-db.org/) database [51, 71] , used to search for related proteins or pathways that had been reported. finally, all the proteins and its partners were recorded in excel in order to import information and build a ppi network in cytoscape software. cytoscape is a free, open-source, java application for visualizing molecular networks and integrating them with gene expression profiles [71, 72] . plugins are available for network and molecular profiling analyses, new layouts, additional file format support, making connections with databases, and searching within large networks [71] . clusterone (clustering with overlapping neighborhood expansion) of cytoscape was utilized to cluster the ppi network into overlapping sub-graphs of highly interconnected nodes. clusterone is a plugin for detecting and clustering potentially overlapping protein complexes from ppi data. the quality of a group was assessed by the number of sub-graphs, p-values, and density. the cluster was discarded when the number of sub-graphs was smaller than 3, the density was less than 0.45, the quality was less than 0.5, and the p-value was under 0.05 [73] . the clustering results of the clusterone are instrumental to understanding how the reliability of the ppi network relates to aconitine alkaloids' cardiotoxicity. cytonca is a plugin in cytoscape integrating calculation, evaluation, and visualization analysis for multiple centrality measures. there are eight centrality measurements provided by cytonca: betweenness, closeness, degree, eigenvector, local average connectivity-based, network, subgraph, and information centrality [74] . the primary purpose of the centrality analysis was to confirm the essential proteins in the pre-built ppi network. the three centrality measurements in the cytonca plugin-subgraph centrality, betweenness centrality, and closeness centrality-were used for evaluating and screening the essential protein in the merged target network. the subgraph centrality characterizes the participation of each node in all subgraphs in a network. smaller subgraphs are given more weight than larger ones, which makes this measurement an appropriate one for characterizing network properties. the subgraph centrality of node "u" can be calculated by [75] µ l (u) is the uth diagonal entry of the lth power of the weight adjacency matrix of the network. v 1 , v 2 , . . . , v n is be an orthonormal basis composed of r n composed by eigenvectors of a associated to the eigenvalues λ 1 , λ 2 , . . . , λ n v u v , which is the uth component of v v [75] . the betweenness centrality finds a wide range of applications in network theory. it represents the degree to which nodes stand between each other. betweenness centrality was devised as a general measure of centrality. it is applicable to a wide range of problems in network theory, including problems related to social networks, biology, transport, and scientific cooperation. the betweenness centrality of a node u can be calculated by [76] ρ (s, t) is the total number of shortest paths from node s to node ρ (s, u, t), which is the number of those paths that pass through u. closeness centrality of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. thus, the more central a node is, the closer it is to all other nodes. the closeness centrality of a node u can be calculated by [77] |nu| is the number of node u's neighbors, and dist (u, v) is the distance of the shortest path from node u to node v. pharmmapper serves as a valuable tool for identifying potential targets for a novel synthetic compound, a newly isolated natural product, a compound with known biological activity, or an existing drug [78] . of all the aconitine alkaloids in this research, compounds 6, 12, and 20 exhibited the most toxic activity and were used for the potential target prediction. the mol2 format of three compounds was submitted to the pharmmapper server. the parameters of generate conformers and maximum generated conformations was set as on and 300, respectively. other parameters used default values. finally, the result of the clusterone and pharmmapper will be combined together to select the potential targets for the following docking study [78] . comparative molecular field analysis (comfa) and comparative molecular similarity index analysis (comsia) are efficient tools in ligand-based drug design and are in use for contour map generation and identification of favorable and unfavorable regions in a moiety [52, 79] . the comfa consists of a steric and electrostatic contour map of molecules that are correlated with toxic activity, while the comsia consists of hydrophobic field, hydrogen bond donor (hbd)/hydrogen bond acceptor (hba) [80] , and steric/electrostatic fields that are correlated with toxic activity. the comfa and comsia have been utilized to generate a 3d-qstr model [81] . all molecule models and the generation of 3d-qstr were performed with sybyl x2.0. alkaloids in mice with ld 50 values listed in table 4 were extracted from recent literature [70] . the ld 50 values of all aconitine alkaloids were converted into pld 50 with a standard tripos force field. these pld50 values were used as a dependent variable, while comfa and comsia descriptors were used as an independent variable. the sketch function of sybyl x2.0 was utilized to illustrate structure and charges, and was calculated by the gasteiger-huckel method. additionally, the tripose force field was utilized for energy minimization of these aconitine alkaloid molecules [81] . the 31 molecules were divided into a ratio of 3:1. the division was done in a way that showed that both datasets are balanced and consist of both active and less active molecules [81] . the reliability of the 3d-qstr model depends on the database molecular alignment. the most toxic aconitine alkaloids (compound 6) was selected as the template molecule, and the tetradecahydro-2h-3,6,12-(epiethane [1,1,2] triyl)-7,9-methanonaphtho [2,3-b] azocine was selected as the common moiety. pls (partial least squares) techniques are associated with field descriptors with activity values such as [80] leave one out (loo) values, the optimal number of components, the standard error of estimation (see), cross-validated coefficients (q 2 ), and the conventional coefficient (r 2 ). these statistical data are pivotal in the evaluation of the 3d-qstr model and can be worked out in the pls method [81] . the model is said to be good when the q 2 value is more than 0.5 and the r 2 value is more than 0.6. the q 2 and r 2 values reflect a model's soundness. the best model has the highest q 2 and r 2 values, the lowest see, and an optimal number of components [80, 82, 83] . in the case of comfa and comsia analysis, the values of the optimal number of components, see, and q 2 can be worked out by loo validation, with use sampls turned on and components set to 5, while in the process of calculating r 2 , the use sampls was turned off and the column filtration was set to 2.0 kcal mol −1 in order to speed up the calculation without the need to sacrifice information content [81] [82] [83] [84] . therefore, components were set to 6 and 4, respectively, which were optimal numbers of components calculated by performing a sampls run. see and r 2 were utilized to assess the non-cross validated model. the applicability domain (ad) of the topomer comfa and comsia model was confirmed by the williams plot of residuals vs. leverage. leverage of a query chemical is proportional to its mahalanobis distance measure from the centroid of the training set [85, 86] . the leverages are calculated for a given dataset x by obtaining the leverage matrix (h) with the equation below: x is the model matrix, while xt is its transpose matrix. the plot of standardized residuals vs. leverage values was drawn, and compounds with standardized residuals greater than three standard deviation units (±3σ) were considered as outliers [85] . the critical leverage value is considered 3 p/n, where p is the number of model variables plus one, and n is the number of objects used to calculate the model. h > 3 p/n mean predicted response is not acceptable [85] [86] [87] . (cadd) software program that incorporates the functions of qsar, molecular docking, molecular dynamics, adme (absorption, distribution, metabolism, and excretion), and homologous modeling. all of these functions are regarded as conducive instruments in the field of drug discovery and biochemistry. the molecular docking and dynamics technology were performed in moe2016 software to detect the stability and affinity between the ligands and predictive targets [88, 89] . the docking process involves the prediction of ligand conformation and orientation within a targeted binding site. docking analysis is an important step in the docking process. it has been widely used to study the reasonable binding mode and obtain information of interactions between amino acids in active protein sites and ligands. the molecular docking analysis was carried out to determine the toxicity-related moiety of aconitine alkaloids through the ligand-amino-acid interaction function in moe2015. the pdb format of 2v7o and 2vz6 was downloaded from the pdb (protein data bank) database (https://www.rcsb.org/), and the mol2 format of compounds was from the sybyl software of qstr research. the structure preparation function in moe software will be carried out to minimize the energy and optimize the structure of the protein skeleton. based on the london dg score and induced fit refinement, all compounds will be docked into the active site of every potential target by taking score values as the scoring function [90] . the dcg (discounted cumulative gain) algorithm was utilized to examine the consistency between the ranking result of pld 50 and our research (fit scores of dock study). they rely on the formula that refers to pld 50 . the idcg (ideal dcg) refers to the ordered pld 50 values. the closer the normalized discounted cumulative gain (ndcg) value is to 1, the better the consistency [91] . preliminary md simulations for the model protein were performed using the program namd (nanoscale molecular dynamics program, v 2.9), and all files were generated using visual molecular dynamics (vmd). namd is a freely available software designed for high-performance simulation of large biomolecular systems [92] . during the md simulation, minimization and equilibration of original and docked proteins occurred in a 15 å3 size water box. a charmm 22 force field file was applied for energy minimization and equilibration with gasteiger-huckel charges using boltzmann initial velocity [93, 94] . integrator parameters also included 2 fs/step for all rigid bonds and nonbonded frequencies were selected for 1 å and full electrostatic evaluations for 2 å were used with 10 steps for each cycle [93] . the particle mesh ewald method was used for electrostatic interactions of the simulation system periodic boundary conditions with grid dimensions of 1.0 å [94] . the pressure was maintained at 101.325 kpa using the langevin piston and the temperature was controlled at 310 k using langevin dynamics. covalent interactions between hydrogen and heavy atoms were constrained using the shake/rattle algorithm. finally, 5 ns md simulations for original and docked protein were carried out for comparing and verifying the binding affinity and stability of the ligand-receptor complex. the method combining network analysis and the in silico method was carried out to illustrate the qstr and toxic mechanisms of aconitine alkaloids. the 3d-qstr was built in sybyl with internal robustness and external high prediction, enabling identification of pivotal molecule moieties related to toxicity in aconitine alkaloids. the comfa model had q 2 , r 2 , optimum component, and correlation coefficient (r 2 ) values of 0.624, 0.966, 6, and 0. 9698, respectively, and the comsia model had q 2 , r 2 , optimum component, and correlation coefficient (r 2 ) values of 0.719, 0.901, 4, and 0.9770. the network was built with cytoscape software and the string database, which demonstrated the reliability of cluster analysis. the 2v7o and 2vz6 proteins were identified as potential targets with the cytonca plugin with pharmmapper server for interactions between the aconitine alkaloids and key amino acids in the dock study. the result of the dock study demonstrates the consistency of the experimental pld 50 . the md simulation indicated that aconitine alkaloids exhibit potent binding affinity and stability to the receptor camk2g. finally, we incorporate pivotal molecule moieties and ligand-receptor interactions to realize the qstr of aconitine alkaloids. this research serves as a guideline for studies of toxicity, including neuro-, reproductive, and embryo-toxicity. with a deep understanding of the relationship between toxicity and structure of aconitine alkaloids, subsequent structural modification of aconitine alkaloids can be carried out to enhance their efficacy and to reduce their toxic side effects. based on such research, aconitine alkaloids can bring us closer to medical and clinical applications. in addition, as pointed out in past research [95] , user-friendly and publicly accessible web servers represent the future direction of reporting various important computational analyses and findings [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] . they have significantly enhanced the impacts of computational biology on medical science [110, 111] . the research in this paper will serve as a foundation for constructing web servers for qstr studies and target identifications of compounds. immunomodulating agents of plant origin. i: preliminary screening chinese drugs plant origin aconitine poisoning: a global perspective ventricular tachycardia after ingestion of ayurveda herbal antidiarrheal medication containing aconitum fatal accidental aconitine poisoning following ingestion of chinese herbal medicine: a report of two cases five cases of aconite poisoning: toxicokinetics of aconitines a case of fatal aconitine poisoning by monkshood ingestion determination of aconitine and hypaconitine in gucixiaotong ye by capillary electrophoresis with field-amplified sample injection a clinical study in epidural injection with lappaconitine compound for post-operative analgesia therapeutic effects of il-12 combined with benzoylmesaconine, a non-toxic aconitine-hydrolysate, against herpes simplex virus type 1 infection in mice following thermal injury aconitine: a potential novel treatment for systemic lupus erythematosus aconitine-containing agent enhances antitumor activity of dichloroacetate against ehrlich carcinoma complex discovery from weighted ppi networks prediction and analysis of the protein interactome in pseudomonas aeruginosa to enable network-based drug target selection the string database in 2017: quality-controlled protein-protein association networks, made broadly accessible identification of functional modules in a ppi network by clique percolation clustering united complex centrality for identification of essential proteins from ppi networks the ppi network and cluster one analysis to explain the mechanism of bladder cancer the progress of novel drug delivery systems mitochondrial uncoupling protein 2 structure determined by nmr molecular fragment searching structural basis for membrane anchoring of hiv-1 envelope spike unusual architecture of the p7 channel from hepatitis c virus architecture of the mitochondrial calcium uniporter structure and mechanism of the m2 proton channel of influenza a virus computer-aided drug design using sesquiterpene lactones as sources of new structures with potential activity against infectious neglected diseases successful in silico discovery of novel nonsteroidal ligands for human sex hormone binding globulin in silico discovery of novel ligands for antimicrobial lipopeptides for computer-aided drug design structural bioinformatics and its impact to biomedical science coupling interaction between thromboxane a2 receptor and alpha-13 subunit of guanine nucleotide-binding protein prediction of the tertiary structure and substrate binding site of caspase-8 study of drug resistance of chicken influenza a virus (h5n1) from homology-modeled 3d structures of neuraminidases insights from investigating the interaction of oseltamivir (tamiflu)with neuraminidase of the 2009 h1 n1 swine flu virus prediction of the tertiary structure of a caspase-9/inhibitor complex design novel dual agonists for treating type-2 diabetes by targeting peroxisome proliferator-activated receptors with core hopping approach heuristic molecular lipophilicity potential (hmlp): a 2d-qsar study to ladh of molecular family pyrazole and derivatives fragment-based quantitative structure & ndash; activity relationship (fb-qsar) for fragment-based drug design investigation into adamantane-based m2 inhibitors with fb-qsar hp-lattice qsar for dynein proteins: experimental proteomics (2d-electrophoresis, mass spectrometry) and theoretic study of a leishmania infantum sequence the biological functions of low-frequency phonons: 2. cooperative effects low-frequency collective motion in biomacromolecules and its biological functions quasi-continuum models of twist-like and accordion-like low-frequency motions in dna collective motion in dna and its role in drug intercalation biophysical aspects of neutron scattering from vibrational modes of proteins biological functions of soliton and extra electron motion in dna structure low-frequency resonance and cooperativity of hemoglobin solitary wave dynamics as a mechanism for explaining the internal motion during microtubule growth designed electromagnetic pulsed therapy: clinical applications steps to the clinic with elf emf molecular dynamics study of the connection between flap closing and binding of fullerene-based inhibitors of the hiv-1 protease molecular dynamics studies on the interactions of ptp1b with inhibitors: from the first phosphate-binding site to the second one the cambridge structural database: a quarter of a million crystal structures and rising molecular similarity indices in a comparative analysis (comsia) of drug molecules to correlate and predict their biological activity single channel analysis of aconitine blockade of calcium channels in rat myocardiocytes conversion of the sodium channel activator aconitine into a potent alpha 7-selective nicotinic ligand aconitine blocks herg and kv1.5 potassium channels inactivation of ca 2+ release channels (ryanodine receptors ryr1 and ryr2) with rapid steps in [ca 2+ ] and voltage targeted disruption of the atp2a1 gene encoding the sarco(endo)plasmic reticulum ca 2+ atpase isoform 1 (serca1) impairs diaphragm function and is lethal in neonatal mice cyclic gmp-dependent protein kinase activity in rat pulmonary microvascular endothelial cells different g proteins mediate somatostatin-induced inward rectifier k + currents in murine brain and endocrine cells cardiac myocyte calcium transport in phospholamban knockout mouse: relaxation and endogenous camkii effects inhibition of camkii phosphorylation of ryr2 prevents induction of atrial fibrillation in fkbp12.6 knock-out mice regulation of ca 2+ and electrical alternans in cardiac myocytes: role of camkii and repolarizing currents the role of calmodulin kinase ii in myocardial physiology and disease excitotoxic neuroprotection and vulnerability with camkii inhibition structure of the camkiiδ/calmodulin complex reveals the molecular mechanism of camkii kinase activation a model of the complex between cyclin-dependent kinase 5 and the activation domain of neuronal cdk5 activator binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against sars an in-depth analysis of the biological functional studies based on the nmr m2 channel structure of influenza a virus molecular therapeutic target for type-2 diabetes novel inhibitor design for hemagglutinin against h1n1 influenza virus by core hopping method the string database in 2011: functional interaction networks of proteins, globally integrated and scored cytoscape: a software environment for integrated models of biomolecular interaction networks detecting overlapping protein complexes in protein-protein interaction networks cytonca: a cytoscape plugin for centrality analysis and evaluation of protein interaction networks subgraph centrality and clustering in complex hyper-networks ranking closeness centrality for large-scale social networks enhancing the enrichment of pharmacophore-based target prediction for the polypharmacological profiles of drugs comparative molecular field analysis (comfa). 1. effect of shape on binding of steroids to carrier proteins sample-distance partial least squares: pls optimized for many variables, with application to comfa a qsar analysis of toxicity of aconitum alkaloids recent advances in qsar and their applications in predicting the activities of chemical molecules, peptides and proteins for drug design unified qsar approach to antimicrobials. 4. multi-target qsar modeling and comparative multi-distance study of the giant components of antiviral drug-drug complex networks comfa qsar models of camptothecin analogues based on the distinctive sar features of combined abc, cd and e ring substitutions applicability domain for qsar models: where theory meets reality comparison of different approaches to define the applicability domain of qsar models molecular docking and qsar analysis of naphthyridone derivatives as atad2 bromodomain inhibitors: application of comfa, ls-svm, and rbf neural network concise applications of molecular modeling software-moe medicinal chemistry and the molecular operating environment (moe): application of qsar and molecular docking to drug discovery qsar models of cytochrome p450 enzyme 1a2 inhibitors using comfa, comsia and hqsar estimating a ranked list of human hereditary diseases for clinical phenotypes by using weighted bipartite network biomolecular simulation on thousands processors molecular dynamics and docking investigations of several zoanthamine-type marine alkaloids as matrix metaloproteinase-1 inhibitors salts influence cathechins and flavonoids encapsulation in liposomes: a molecular dynamics investigation review: recent advances in developing web-servers for predicting protein attributes irna-ai: identifying the adenosine to inosine editing sites in rna sequences iss-psednc: identifying splicing sites using pseudo dinucleotide composition irna-pseu: identifying rna pseudouridine sites ploc-mplant: predict subcellular localization of multi-location plant proteins by incorporating the optimal go information into general pseaac ploc-mhum: predict subcellular localization of multi-location human proteins via general pseaac to winnow out the crucial go information iatc-misf: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals psuc-lys: predict lysine succinylation sites in proteins with pseaac and ensemble random forest approach irnam5c-psednc: identifying rna 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition ikcr-pseens: identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier iacp: a sequence-based tool for identifying anticancer peptides ploc-meuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key go information into general pseaac iatc-mhyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals ihsp-pseraaac: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition irna-psecoll: identifying the occurrence sites of different rna modifications by incorporating collective effects of nucleotides into pseknc impacts of bioinformatics to medicinal chemistry an unprecedented revolution in medicinal chemistry driven by the progress of biological science this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license key: cord-034833-ynti5g8j authors: nosonovsky, michael; roy, prosun title: scaling in colloidal and biological networks date: 2020-06-04 journal: entropy (basel) doi: 10.3390/e22060622 sha: doc_id: 34833 cord_uid: ynti5g8j scaling and dimensional analysis is applied to networks that describe various physical systems. some of these networks possess fractal, scale-free, and small-world properties. the amount of information contained in a network is found by calculating its shannon entropy. first, we consider networks arising from granular and colloidal systems (small colloidal and droplet clusters) due to pairwise interaction between the particles. many networks found in colloidal science possess self-organizing properties due to the effect of percolation and/or self-organized criticality. then, we discuss the allometric laws in branching vascular networks, artificial neural networks, cortical neural networks, as well as immune networks, which serve as a source of inspiration for both surface engineering and information technology. scaling relationships in complex networks of neurons, which are organized in the neocortex in a hierarchical manner, suggest that the characteristic time constant is independent of brain size when interspecies comparison is conducted. the information content, scaling, dimensional, and topological properties of these networks are discussed. scaling methods and dimensional analysis are widely used in various areas of physics. the concepts of fractals (scale-free objects), power exponents, and near-critical behavior are central to the study of scaling. one particularly interesting field of application is biophysical problems, where both experimental observations and theoretical explanations have been suggested of how various quantitative characteristics of a living organism (for example, the rate of metabolism) depend on its mass and linear size. the area is often referred to as allometry [1] [2] [3] [4] [5] [6] . one of the most widely known examples of an allometric scaling relationship is the empirical kleiber law, which is based on a comparison of the metabolism rates, b, of different species. kleiber [1] established a power law dependency with the exponent 3 4 of b upon the mass of an animal, b∝m 3/4 . the explanation of the value of the exponent 3 4 was suggested in an influential paper by west et al. [2] , who used a fractal model of branching of blood vessels serving a certain volume with the conservation of the cross-sectional area of the vessels and of the volume covered at every stage of branching. despite its shortcomings [3] [4] [5] [6] , the fractal theory by west et al. [2] remains the main explanation of the allometric scaling exponents, and it can likely be expanded on such topics as the analysis of the ergodicity in the vascular network [7] . besides the biophysical applications, scaling problems are also prominent in those areas of physics, where the so-called intermediate asymptotic behavior is found [8] . a typical example of an intermediate asymptotic process would be a transition from a single object to a thermodynamic description of a continuum medium. the intermediate ensemble of a few natural or artificial objects, which bridges between a single particle and continuum matter, is often called a cluster, and it possesses many properties absent from both single-particle and continuum medium description [9] . this includes clusters are defined as collections of somewhat similar, but not necessarily identical objects. in physics and chemistry, clusters are often built of liquid droplets or small solid particles, such as nanoparticles, and colloidal particles [9] . from the physical point of view, they occupy an intermediate position between individual objects and bulk material consisting of a large number of objects, to which the methods of statistical physics and thermodynamics are often applied. since clusters bridge between individual particles and bulk matter, they may have certain unique collective properties absent in both individual objects and in bulk materials. given that there are interactions between particles in a cluster, the latter can be modeled by a clusters are defined as collections of somewhat similar, but not necessarily identical objects. in physics and chemistry, clusters are often built of liquid droplets or small solid particles, such as nanoparticles, and colloidal particles [9] . from the physical point of view, they occupy an intermediate position between individual objects and bulk material consisting of a large number of objects, to which the methods of statistical physics and thermodynamics are often applied. since clusters bridge between individual particles and bulk matter, they may have certain unique collective properties absent in both individual objects and in bulk materials. given that there are interactions between particles in a cluster, the latter can be modeled by a network. a network is a graph, which consists of nodes (vertices) connected by edges. the clustering coefficient is defined as a ratio of the number of closed triplets to all triplets. this coefficient is a measure of the degree to which nodes in a graph tend to cluster together. it is used to quantify aggregation in granular media, indicating the degree of clustering of particles in the cluster. two physical concepts relevant to the self-organization of various networks are self-organized criticality (soc) and percolation [14] [15] [16] 18, 19] . there is a big class of dynamical systems that operate in such a manner that they always tune themselves to the critical state, where the stability of the system is lost. since the 1980s, it has been suggested that a very specific type of self-organization, called soc, plays a role in diverse "avalanche-like" processes. typically, energy is accumulated in these systems until the critical state is reached, and then energy is suddenly released. examples are various avalanche systems, including those describing landslides, earthquakes, and frictional stick-slip. a random perturbation can trigger an avalanche (or a slip event) in such a system. the magnitude of the avalanche cannot be predicted in advance, because it is random. after the release, the system returns to the stable state for some time until the next event is triggered. the amplitudes of the events have statistical characteristics of critical behavior, such as universality, critical exponents, and fractality. a famous example is the power law, which relates the frequency and the magnitude of earthquakes, known as the gutenberg-richter law. for example, similar behavior is observed in frictional stick-slip systems and in many other systems [15] . the best-studied example of soc is the "sandpile model", which represent a conical pile of sand with new grains of sand randomly placed into the pile (figure 2 ). when the slope exceeds a threshold value (the critical slope angle is related to the coefficient of dry friction between the grains), a grain would move down the slope. placing a random grain at a particular site may have no effect, or it may trigger an avalanche that will affect many sites at the lattice. thus, the response does not depend on the details of the perturbation. it is worth mentioning that the scale of the avalanche may be much greater than the scale of the initial perturbation. entropy 2020, 25, x for peer review 4 of 25 coefficient is defined as a ratio of the number of closed triplets to all triplets. this coefficient is a measure of the degree to which nodes in a graph tend to cluster together. it is used to quantify aggregation in granular media, indicating the degree of clustering of particles in the cluster. two physical concepts relevant to the self-organization of various networks are self-organized criticality (soc) and percolation [14] [15] [16] 18, 19] . there is a big class of dynamical systems that operate in such a manner that they always tune themselves to the critical state, where the stability of the system is lost. since the 1980s, it has been suggested that a very specific type of self-organization, called soc, plays a role in diverse "avalanche-like" processes. typically, energy is accumulated in these systems until the critical state is reached, and then energy is suddenly released. examples are various avalanche systems, including those describing landslides, earthquakes, and frictional stickslip. a random perturbation can trigger an avalanche (or a slip event) in such a system. the magnitude of the avalanche cannot be predicted in advance, because it is random. after the release, the system returns to the stable state for some time until the next event is triggered. the amplitudes of the events have statistical characteristics of critical behavior, such as universality, critical exponents, and fractality. a famous example is the power law, which relates the frequency and the magnitude of earthquakes, known as the gutenberg-richter law. for example, similar behavior is observed in frictional stick-slip systems and in many other systems [15] . the best-studied example of soc is the "sandpile model", which represent a conical pile of sand with new grains of sand randomly placed into the pile (figure 2 ). when the slope exceeds a threshold value (the critical slope angle is related to the coefficient of dry friction between the grains), a grain would move down the slope. placing a random grain at a particular site may have no effect, or it may trigger an avalanche that will affect many sites at the lattice. thus, the response does not depend on the details of the perturbation. it is worth mentioning that the scale of the avalanche may be much greater than the scale of the initial perturbation. the sand-pile conceptual model of self-organized criticality (soc). the pile tends to have a slope angle defined by the friction between grains. adding one new grain to the pile may have no effect (grain is at rest) or it may cause an avalanche. the magnitude and frequency of avalanches are inversely related (based on [15] ). the concept has been applied to such diverse fields as physics, cellular automata theory, biology, economics, sociology, linguistics, and others. there are typical external signs of an soc system, such as the power-law behavior (the magnitude distribution of the avalanches) and the 'one-overfrequency' noise distribution (the amplitude of random fluctuations is inversely proportional to the frequency). of course, not every system with a one-over-frequency spectrum is a soc system. the one-over-frequency noise (referred to also as the white noise) may be present also in non-soc systems. another important concept, which is related to soc, is the percolation [15, 18] . typically, during percolation, a certain controlled parameter is slowly changed; for example, nodes are removed from the sand-pile conceptual model of self-organized criticality (soc). the pile tends to have a slope angle defined by the friction between grains. adding one new grain to the pile may have no effect (grain is at rest) or it may cause an avalanche. the magnitude and frequency of avalanches are inversely related (based on [15] ). the concept has been applied to such diverse fields as physics, cellular automata theory, biology, economics, sociology, linguistics, and others. there are typical external signs of an soc system, such as the power-law behavior (the magnitude distribution of the avalanches) and the 'one-over-frequency' noise distribution (the amplitude of random fluctuations is inversely proportional entropy 2020, 22, 622 5 of 26 to the frequency). of course, not every system with a one-over-frequency spectrum is a soc system. the one-over-frequency noise (referred to also as the white noise) may be present also in non-soc systems. another important concept, which is related to soc, is the percolation [15, 18] . typically, during percolation, a certain controlled parameter is slowly changed; for example, nodes are removed from a network or conducting sites are added (figure 3a) , or shear force is increased in a system with friction. when a critical value of the controlled parameter is achieved, the avalanche can be triggered and a corresponding output parameter, such as the correlation length, may reach infinity (figure 3b ). for example, the correlation length can characterize the average size of black or white islands on a field of the opposite color, when random pixels are added (figure 3c ). at the critical point, the configuration is fractal (an infinite set of white islands on the black background forming larger black islands on the white background, and so on). entropy 2020, 25 and a corresponding output parameter, such as the correlation length, may reach infinity (figure 3b ). for example, the correlation length can characterize the average size of black or white islands on a field of the opposite color, when random pixels are added (figure 3c ). at the critical point, the configuration is fractal (an infinite set of white islands on the black background forming larger black islands on the white background, and so on). (a) (b) (c) showing that the first four iteration steps at the eighth step all sites become active [18] . (b) a typical dependency of the correlation length on the shear load for an avalanche. at the critical value of the load, τ0, the correlation length approaches the infinity. (c) with the increasing normal load, the size of slip zone spots (black) increases. a transition to the global sliding is expected when the correlation length approaches infinity. the topological concepts from the theory of networks turn out to be useful for physical characterization of packing of granular material, colloidal crystals, and clusters of droplets and colloidal particles. the so-called force network is important for the understanding of packing of granular material. figure 4 shows a granular material (blue and black circles represent granules) flowing in a channel. some grains are tightly packed and apply a force upon their neighbors, while others are loose and do not transmit force. with increasing pressure in the channel, the number of force-transmitting grains (black) increases, and when the chain of black grains through the entire width of the channel, it is jammed, which is called the jamming transition. [18] . (b) a typical dependency of the correlation length on the shear load for an avalanche. at the critical value of the load, τ 0 , the correlation length approaches the infinity. (c) with the increasing normal load, the size of slip zone spots (black) increases. a transition to the global sliding is expected when the correlation length approaches infinity. the topological concepts from the theory of networks turn out to be useful for physical characterization of packing of granular material, colloidal crystals, and clusters of droplets and colloidal particles. the so-called force network is important for the understanding of packing of granular material. figure 4 shows a granular material (blue and black circles represent granules) flowing in a channel. some grains are tightly packed and apply a force upon their neighbors, while others are loose and do not transmit force. with increasing pressure in the channel, the number of force-transmitting grains (black) increases, and when the chain of black grains through the entire width of the channel, it is jammed, which is called the jamming transition. the so-called force network is important for the understanding of packing of granular material. figure 4 shows a granular material (blue and black circles represent granules) flowing in a channel. some grains are tightly packed and apply a force upon their neighbors, while others are loose and do not transmit force. with increasing pressure in the channel, the number of force-transmitting grains (black) increases, and when the chain of black grains through the entire width of the channel, it is jammed, which is called the jamming transition. [20, 21] ). the force is transmitted through chains of connecting grains shown in black. [20, 21] ). the force is transmitted through chains of connecting grains shown in black. the force network connects centers of mass of each pair of grains that have a force transmitting contact. such network presentation provides key insights for understanding the mechanical response of a soil or sand heap. moreover, percolation, i.e., the formation of a force-transmitting chain in such a network, corresponds to the jamming transition in the granular material [20, 21] . the percolation phenomenon in application to networks will be discussed more in detail in the consequent section. near-percolation behavior is known to demonstrate scale-free features. in particular, the small-world effect and scale-free behavior were reported for packing problems related to aggregation of granular media, which employs the so-called "apollonian packing" [22] . for the simple packing of identical (monodisperse) particles, the force networks do not possess self-similar, scale-free, or small-world properties. however, to achieve high packing densities, the apollonian construction can be employed ( figure 5 ). such construction involves a multiscale set of circles with smaller circles fitting the space between larger ones. this may be needed for high-performance concrete (hpc) and certain ultra-strong ceramics. andrade et al. [22] found that the force networks resulting from the apollonian construction, which they called apollonian networks (ans), have many special properties, including the scale-free and small-world effects. entropy 2020, 25, x for peer review 6 of 25 the force network connects centers of mass of each pair of grains that have a force transmitting contact. such network presentation provides key insights for understanding the mechanical response of a soil or sand heap. moreover, percolation, i.e., the formation of a force-transmitting chain in such a network, corresponds to the jamming transition in the granular material [20, 21] . the percolation phenomenon in application to networks will be discussed more in detail in the consequent section. near-percolation behavior is known to demonstrate scale-free features. in particular, the small-world effect and scale-free behavior were reported for packing problems related to aggregation of granular media, which employs the so-called "apollonian packing" [22] . for the simple packing of identical (monodisperse) particles, the force networks do not possess selfsimilar, scale-free, or small-world properties. however, to achieve high packing densities, the apollonian construction can be employed ( figure 5 ). such construction involves a multiscale set of circles with smaller circles fitting the space between larger ones. this may be needed for highperformance concrete (hpc) and certain ultra-strong ceramics. andrade et al. [22] found that the force networks resulting from the apollonian construction, which they called apollonian networks (ans), have many special properties, including the scale-free and small-world effects. even more interesting phenomena where the theory of networks can be applied are small droplet clusters [23] [24] [25] [26] [27] and small colloidal crystals. self-assembled clusters of condensed microdroplets (with the typical diameter of dozens of microns) are formed in an ascending flow of vapor and air above a locally heated thin (approximately 1 mm) layer of water. the droplets form a 2d monolayer levitating at a low height (comparable with their radii) where their weight is equilibrated by the drag force from the ascending vapor flow. due to an aerodynamic interaction (repulsion) between the droplets and their migration toward the center of the heated spot, they tend to form an ordered structure ( figure 6 ). even more interesting phenomena where the theory of networks can be applied are small droplet clusters [23] [24] [25] [26] [27] and small colloidal crystals. self-assembled clusters of condensed microdroplets (with the typical diameter of dozens of microns) are formed in an ascending flow of vapor and air above a locally heated thin (approximately 1 mm) layer of water. the droplets form a 2d monolayer levitating at a low height (comparable with their radii) where their weight is equilibrated by the drag force from the ascending vapor flow. due to an aerodynamic interaction (repulsion) between the droplets and their migration toward the center of the heated spot, they tend to form an ordered structure ( figure 6 ). microdroplets (with the typical diameter of dozens of microns) are formed in an ascending flow of vapor and air above a locally heated thin (approximately 1 mm) layer of water. the droplets form a 2d monolayer levitating at a low height (comparable with their radii) where their weight is equilibrated by the drag force from the ascending vapor flow. due to an aerodynamic interaction (repulsion) between the droplets and their migration toward the center of the heated spot, they tend to form an ordered structure ( figure 6 ). for large clusters consisting of many dozens or hundreds of droplets, a hexagonally symmetric (honeycomb) structure is typically formed. however, for small clusters, more complex symmetric structures can form in comparison with those in large clusters. for example, the applicability of the mathematical ade-classification or the so-called simply laced dynkin diagrams has been suggested [27] . small clusters can be used for the in situ tracking of bioaerosols and biomolecules [25] . the method of voronoi entropy is applied to characterize quantitatively the degree of orderliness of the geometric arrangement of the droplet clusters. the voronoi entropy is calculated using the so-called voronoi tessellation, when an image is divided into a set of polygons. each polygon consists of all points closer to the center of a particular droplet than to any other droplet. the voronoi entropy is then calculated as s vor = − n p n ln p n , where p n is the fraction of polygons with n sides or edges [26] . in general, three scaling laws relevant to the voronoi entropy in such colloidal systems are lewis' law, the desch law, and the aboav law. lewis observed a linear relationship between the average area of a typical n-gon and n for various random 2d cellular mosaics. the desch law states a linear relationship between the perimeter of polygons and the number of their edges, while the aboav law relates the average number sides of a voronoi cell neighboring an n-gon a + b/n, where a and b are constants [26] . unlike liquid water droplets, colloidal particles are solid, and they can form small clusters, which can levitate due to acoustic waves or due to another mechanism, but they form a close packed structure. thus, perry et al. [28] studied the structural rearrangement in a 2d levitating cluster of solid sulfate polystyrene spherical particles. individual particles in a particular configuration (an excited state) of the cluster may have bonds between them. for a system of six particles, perry et al. [28] identified a number of seven-bond and eight-bond configurations. in similar systems, lim et al. [29] reported the transitions from sticky to ergodic configurations in six-particle and seven-particle systems of hard 1.3 µm diameter spheres (figure 7 ). the six-particle system can form various arrangements, with different probabilities of these arrangements. solid sulfate polystyrene spherical particles. individual particles in a particular configuration (an excited state) of the cluster may have bonds between them. for a system of six particles, perry et al. [28] identified a number of seven-bond and eight-bond configurations. in similar systems, lim et al. [29] reported the transitions from sticky to ergodic configurations in six-particle and seven-particle systems of hard 1.3 μm diameter spheres (figure 7 ). the six-particle system can form various arrangements, with different probabilities of these arrangements. for many distributions of small elements (for example, the words in a language or letters in a text), an empirical power law, such as the zipf law, is found. the zipf law is given by the formula where p(k) is the frequency of an element of rank k, a is a power exponent, and the denominator is needed for normalization. a power law distribution may be expected for the probabilities of various excited states of the cluster forming a set of rearrangement configurations. it is instructive to investigate the probabilistic distributions and information content in graphs corresponding to small clusters. for the data by perry et al. [28] , the probability distribution of various configurations was plotted in figure 8 (the rank is a number of a given configuration). a calculation based on the empirical formula of the zipf law (equation 4) using the experimental data points has been done to fit those experimental points. then, a fitted curve has also been plotted in figure 8 with the data presented in figure 7 . schematic of colloidal particles forming small clusters (concept based on [28] ). for many distributions of small elements (for example, the words in a language or letters in a text), an empirical power law, such as the zipf law, is found. the zipf law is given by the formula where p(k) is the frequency of an element of rank k, a is a power exponent, and the denominator is needed for normalization. a power law distribution may be expected for the probabilities of various excited states of the cluster forming a set of rearrangement configurations. it is instructive to investigate the probabilistic distributions and information content in graphs corresponding to small clusters. for the data by perry et al. [28] , the probability distribution of various configurations was plotted in figure 8 (the rank is a number of a given configuration). a calculation based on the empirical formula of the zipf law (equation (4)) using the experimental data points has been done to fit those experimental points. then, a fitted curve has also been plotted in figure 8 with the data presented in table 1 . the value of the power exponent in the curve fitting equation is almost one, which indicates that the fitted curve is hyperbolic. see also a discussion of 2d colloidal clusters by janai et al. [30] . entropy 2020, 25, x for peer review 8 of 25 table 1 . the value of the power exponent in the curve fitting equation is almost one, which indicates that the fitted curve is hyperbolic. see also a discussion of 2d colloidal clusters by janai et al. [30] . structures. each point is representing each distinguished structure of a colloidal cluster (based on data from [28] ). the information content of a distribution is characterized by the shannon entropy, which is given by where pn is the statistical probability of the n-th state and n is the total number of states. the shannon entropy is used in materials science, for example, as a surface roughness parameter characterizing informational content in the surface given by its profile [15] . the shannon-entropy-based informational approach is also used for various other aspects of surface science, such as wetting transitions [31] and stick-slip transition [32] . using the data from figure 8 and tables 1 and 2 , the following values were obtained. for the each point is representing each distinguished structure of a colloidal cluster (based on data from [28] ). table 1 . the value of the power exponent in the curve fitting equation is almost one, which indicates that the fitted curve is hyperbolic. see also a discussion of 2d colloidal clusters by janai et al. [30] . structures. each point is representing each distinguished structure of a colloidal cluster (based on data from [28] ). the information content of a distribution is characterized by the shannon entropy, which is given by where pn is the statistical probability of the n-th state and n is the total number of states. the shannon entropy is used in materials science, for example, as a surface roughness parameter characterizing informational content in the surface given by its profile [15] . the shannon-entropy-based informational approach is also used for various other aspects of surface science, such as wetting transitions [31] and stick-slip transition [32] . using the data from figure 8 and tables 1 and 2, the following values were obtained. for the seven-bond configurations, the value of the shannon entropy s = 2.745 was obtained, while for the eight-bond configuration, the value of s = 3.400 was obtained. the shannon entropy provides an estimation of the information content in these configurations; in particular, one could expect that the seven-bond cluster is more random than the eight-bond cluster. to conclude this section, methods of network science can be used for the analysis of various systems studied by physical chemistry and materials science. these include granular materials, colloidal crystals, and clusters made of small particles or droplets. many such systems form sets of configurations somewhat similar to the set of symbols (e.g., letters) and characterized by power-law statistical distributions typical for the latter. the power law distribution is also characteristic for scalefree networks, which will be discussed more in detail in the following chapter. the information content of these structures can be estimated using the shannon entropy approach. table 1 . the value of the power exponent in the curve fitting equation is almost one, which indicates that the fitted curve is hyperbolic. see also a discussion of 2d colloidal clusters by janai et al. [30] . the information content of a distribution is characterized by the shannon entropy, which is given by where pn is the statistical probability of the n-th state and n is the total number of states. the shannon entropy is used in materials science, for example, as a surface roughness parameter characterizing informational content in the surface given by its profile [15] . the shannon-entropy-based informational approach is also used for various other aspects of surface science, such as wetting transitions [31] and stick-slip transition [32] . using the data from figure 8 and tables 1 and 2, the following values were obtained. for the seven-bond configurations, the value of the shannon entropy s = 2.745 was obtained, while for the eight-bond configuration, the value of s = 3.400 was obtained. the shannon entropy provides an estimation of the information content in these configurations; in particular, one could expect that the seven-bond cluster is more random than the eight-bond cluster. to conclude this section, methods of network science can be used for the analysis of various systems studied by physical chemistry and materials science. these include granular materials, colloidal crystals, and clusters made of small particles or droplets. many such systems form sets of configurations somewhat similar to the set of symbols (e.g., letters) and characterized by power-law statistical distributions typical for the latter. the power law distribution is also characteristic for scalefree networks, which will be discussed more in detail in the following chapter. the information content of these structures can be estimated using the shannon entropy approach. table 1 . the value of the power exponent in the curve fitting equation is almost one, which indicates that the fitted curve is hyperbolic. see also a discussion of 2d colloidal clusters by janai et al. [30] . the information content of a distribution is characterized by the shannon entropy, which is given by where pn is the statistical probability of the n-th state and n is the total number of states. the shannon entropy is used in materials science, for example, as a surface roughness parameter characterizing informational content in the surface given by its profile [15] . the shannon-entropy-based informational approach is also used for various other aspects of surface science, such as wetting transitions [31] and stick-slip transition [32] . using the data from figure 8 and tables 1 and 2, the following values were obtained. for the seven-bond configurations, the value of the shannon entropy s = 2.745 was obtained, while for the eight-bond configuration, the value of s = 3.400 was obtained. the shannon entropy provides an estimation of the information content in these configurations; in particular, one could expect that the seven-bond cluster is more random than the eight-bond cluster. to conclude this section, methods of network science can be used for the analysis of various systems studied by physical chemistry and materials science. these include granular materials, colloidal crystals, and clusters made of small particles or droplets. many such systems form sets of configurations somewhat similar to the set of symbols (e.g., letters) and characterized by power-law statistical distributions typical for the latter. the power law distribution is also characteristic for scalefree networks, which will be discussed more in detail in the following chapter. the information content of these structures can be estimated using the shannon entropy approach. table 1 . the value of the power exponent in the curve fitting equation is almost one, which indicates that the fitted curve is hyperbolic. see also a discussion of 2d colloidal clusters by janai et al. [30] . the information content of a distribution is characterized by the shannon entropy, which is given by where pn is the statistical probability of the n-th state and n is the total number of states. the shannon entropy is used in materials science, for example, as a surface roughness parameter characterizing informational content in the surface given by its profile [15] . the shannon-entropy-based informational approach is also used for various other aspects of surface science, such as wetting transitions [31] and stick-slip transition [32] . using the data from figure 8 and tables 1 and 2, the following values were obtained. for the seven-bond configurations, the value of the shannon entropy s = 2.745 was obtained, while for the eight-bond configuration, the value of s = 3.400 was obtained. the shannon entropy provides an estimation of the information content in these configurations; in particular, one could expect that the seven-bond cluster is more random than the eight-bond cluster. to conclude this section, methods of network science can be used for the analysis of various systems studied by physical chemistry and materials science. these include granular materials, colloidal crystals, and clusters made of small particles or droplets. many such systems form sets of configurations somewhat similar to the set of symbols (e.g., letters) and characterized by power-law statistical distributions typical for the latter. the power law distribution is also characteristic for scalefree networks, which will be discussed more in detail in the following chapter. the information content of these structures can be estimated using the shannon entropy approach. table 1 . the value of the power exponent in the curve fitting equation is almost one, which indicates that the fitted curve is hyperbolic. see also a discussion of 2d colloidal clusters by janai et al. [30] . the information content of a distribution is characterized by the shannon entropy, which is given by where pn is the statistical probability of the n-th state and n is the total number of states. the shannon entropy is used in materials science, for example, as a surface roughness parameter characterizing informational content in the surface given by its profile [15] . the shannon-entropy-based informational approach is also used for various other aspects of surface science, such as wetting transitions [31] and stick-slip transition [32] . using the data from figure 8 and tables 1 and 2, the following values were obtained. for the seven-bond configurations, the value of the shannon entropy s = 2.745 was obtained, while for the eight-bond configuration, the value of s = 3.400 was obtained. the shannon entropy provides an estimation of the information content in these configurations; in particular, one could expect that the seven-bond cluster is more random than the eight-bond cluster. to conclude this section, methods of network science can be used for the analysis of various systems studied by physical chemistry and materials science. these include granular materials, colloidal crystals, and clusters made of small particles or droplets. many such systems form sets of configurations somewhat similar to the set of symbols (e.g., letters) and characterized by power-law statistical distributions typical for the latter. the power law distribution is also characteristic for scalefree networks, which will be discussed more in detail in the following chapter. the information content of these structures can be estimated using the shannon entropy approach. table 1 . the value of the power exponent in the curve fitting equation is almost one, which indicates that the fitted curve is hyperbolic. see also a discussion of 2d colloidal clusters by janai et al. [30] . the information content of a distribution is characterized by the shannon entropy, which is given by where pn is the statistical probability of the n-th state and n is the total number of states. the shannon entropy is used in materials science, for example, as a surface roughness parameter characterizing informational content in the surface given by its profile [15] . the shannon-entropy-based informational approach is also used for various other aspects of surface science, such as wetting transitions [31] and stick-slip transition [32] . using the data from figure 8 and tables 1 and 2, the following values were obtained. for the seven-bond configurations, the value of the shannon entropy s = 2.745 was obtained, while for the eight-bond configuration, the value of s = 3.400 was obtained. the shannon entropy provides an estimation of the information content in these configurations; in particular, one could expect that the seven-bond cluster is more random than the eight-bond cluster. to conclude this section, methods of network science can be used for the analysis of various systems studied by physical chemistry and materials science. these include granular materials, colloidal crystals, and clusters made of small particles or droplets. many such systems form sets of configurations somewhat similar to the set of symbols (e.g., letters) and characterized by power-law statistical distributions typical for the latter. the power law distribution is also characteristic for scalefree networks, which will be discussed more in detail in the following chapter. the information content of these structures can be estimated using the shannon entropy approach. table 1 . the value of the power exponent in the curve fitting equation is almost one, which indicates that the fitted curve is hyperbolic. see also a discussion of 2d colloidal clusters by janai et al. [30] . the information content of a distribution is characterized by the shannon entropy, which is given by where pn is the statistical probability of the n-th state and n is the total number of states. the shannon entropy is used in materials science, for example, as a surface roughness parameter characterizing informational content in the surface given by its profile [15] . the shannon-entropy-based informational approach is also used for various other aspects of surface science, such as wetting transitions [31] and stick-slip transition [32] . using the data from figure 8 and tables 1 and 2 , the following values were obtained. for the seven-bond configurations, the value of the shannon entropy s = 2.745 was obtained, while for the eight-bond configuration, the value of s = 3.400 was obtained. the shannon entropy provides an estimation of the information content in these configurations; in particular, one could expect that the seven-bond cluster is more random than the eight-bond cluster. to conclude this section, methods of network science can be used for the analysis of various systems studied by physical chemistry and materials science. these include granular materials, colloidal crystals, and clusters made of small particles or droplets. many such systems form sets of configurations somewhat similar to the set of symbols (e.g., letters) and characterized by power-law statistical distributions typical for the latter. the power law distribution is also characteristic for scalefree networks, which will be discussed more in detail in the following chapter. the information content of these structures can be estimated using the shannon entropy approach. table 1 . the value of the power exponent in the curve fitting equation is almost one, which indicates that the fitted curve is hyperbolic. see also a discussion of 2d colloidal clusters by janai et al. [30] . the information content of a distribution is characterized by the shannon entropy, which is given by where pn is the statistical probability of the n-th state and n is the total number of states. the shannon entropy is used in materials science, for example, as a surface roughness parameter characterizing informational content in the surface given by its profile [15] . the shannon-entropy-based informational approach is also used for various other aspects of surface science, such as wetting transitions [31] and stick-slip transition [32] . using the data from figure 8 and tables 1 and 2 , the following values were obtained. for the seven-bond configurations, the value of the shannon entropy s = 2.745 was obtained, while for the eight-bond configuration, the value of s = 3.400 was obtained. the shannon entropy provides an estimation of the information content in these configurations; in particular, one could expect that the seven-bond cluster is more random than the eight-bond cluster. to conclude this section, methods of network science can be used for the analysis of various systems studied by physical chemistry and materials science. these include granular materials, colloidal crystals, and clusters made of small particles or droplets. many such systems form sets of configurations somewhat similar to the set of symbols (e.g., letters) and characterized by power-law statistical distributions typical for the latter. the power law distribution is also characteristic for scalefree networks, which will be discussed more in detail in the following chapter. the information content of these structures can be estimated using the shannon entropy approach. the information content of a distribution is characterized by the shannon entropy, which is given by where p n is the statistical probability of the n-th state and n is the total number of states. the shannon entropy is used in materials science, for example, as a surface roughness parameter characterizing informational content in the surface given by its profile [15] . the shannon-entropy-based informational approach is also used for various other aspects of surface science, such as wetting transitions [31] and stick-slip transition [32] . using the data from figure 8 and tables 1 and 2 , the following values were obtained. for the seven-bond configurations, the value of the shannon entropy s = 2.745 was obtained, while for the eight-bond configuration, the value of s = 3.400 was obtained. the shannon entropy provides an estimation of the information content in these configurations; in particular, one could expect that the seven-bond cluster is more random than the eight-bond cluster. table 2 . structures of seven-bond colloidal clusters and their magnitudes of probability distribution and zipf-law distribution (data from [28] we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer ( figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer ( figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer ( figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer ( figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer ( figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer ( figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer (figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer (figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer (figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer (figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer (figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer (figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer (figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer (figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer (figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer (figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . to conclude this section, methods of network science can be used for the analysis of various systems studied by physical chemistry and materials science. these include granular materials, colloidal crystals, and clusters made of small particles or droplets. many such systems form sets of configurations somewhat similar to the set of symbols (e.g., letters) and characterized by power-law statistical distributions typical for the latter. the power law distribution is also characteristic for scale-free networks, which will be discussed more in detail in the following chapter. the information content of these structures can be estimated using the shannon entropy approach. we conclude that using the network representation for colloidal systems, such as the granular material, colloidal crystals made of small rigid particles, and levitating droplet clusters, results in the scale-free behavior and in a number of important scaling relationships such as the zipf, lewis, desch, and aboav scaling laws. another area closely related to colloidal science is surface science. surface scientists and engineers often deal with parameters of surfaces such as surface roughness, surface free energy, and water contact angle. usually, these parameters are determined in an experimental manner, and they cannot be predicted from the first physical principles. a more applied branch of surface science, which deals with friction, roughness, lubrication, and adhesion is called tribology. tribology deals with such characteristics of contacting surfaces as the coefficient of friction and the wear rate. however, one of the challenges in the tribological studies is that while there is a big amount of experimental data about the frictional, wear, and surface properties of various materials, systems, and engineering components, this interdisciplinary area is highly empirical. tribology remains a data-driven inductive science. it has been recently suggested to apply machine learning techniques in order to predict surface wetting properties. kordijazi et. al. [33] applied a multilayer perception neural network model to study the wetting properties of a ductile iron-graphite composite including complex dependencies between the contact angle, composition, surface roughness, and the time of exposure to liquid. understanding these correlations allows predicting water-repellent properties of metallic composite materials for the optimized design of novel hydrophobic and superhydrophobic materials. artificial neural networks (anns) are computer models somewhat resembling neural networks in the human brain. anns incorporate a series of functions to process the input data and convert them over several stages into the desired output. since ann models learn by examples and training, they are suited for storing and retrieving acquired knowledge. a typical ann model has interconnected nodes that model complex neurons and synapses in the human brain. the knowledge acquired during the training process is stored in the synaptic weights of the inter-nodal connections leading to the ability to represent complex input-output relationships. anns learn by examining individual records, generating the prediction for each record, comparing the result with the prediction, and making adjustments. training makes the network more accurate in predicting new outcomes. typically, a neural network has three parts or layers: the input, intermediate (hidden) layers, and output layer (figure 9 ). the input layer with units representing the input data, such as the data about the material composition and surface roughness or conditions of the experiment, for example, the size of the droplets used in wetting experiments, and the time of exposure. one or more hidden layers connect the units with varying connection weights until the results are finally delivered to the units in the output layer. the units in the hidden layer compute their activations based on the data from the input layer and a non-linear transfer function and transmit it to the output layer [33] . ann models and other machine learning techniques will likely become a popular tool for the analysis of properties of surface and colloidal systems. note that anns themselves do not possess any scaling properties relevant to the study of physical systems. however, anns represent a socalled biomimetic approach, because they attempt to mimic learning algorithms in human brains. therefore, it is of interest to investigate natural neural networks or cortical networks, and these will be discussed in the following section. let us start the discussion of scaling in biological objects from the scaling rules in branching networks, for which the typical example is the vascular system of mammals. according to the empirical allometric klieber law [1] formulated in 1932, metabolism rates, b, in various species are well approximated by a power-law scaling dependency on the mass of an animal, ∝ . . based on the metabolism rates, one can estimate other parameters in species including the lifespan. the value of the klieber law exponent, a = 0.75, remained mysterious until the seminal paper by west et al. [2] explained it using the fractal model of blood vessel branching. the blood vessels serve a certain volume with the simultaneous conservation of the cross-sectional area of the vessels and of the volume covered at every stage of branching. despite its shortcomings [3] [4] [5] [6] , this theory remains the main explanation of the allometric scaling exponents. according to west et al. [2] , when a tube with the length lk and radius rk branches into n tubes with the lengths lk+1= γlk and radii rk+1= βrk (where γ and β are constants), the volume served by the next-generation tubes and their cross-section area should conserve, which leads to two separate scaling relationships for the constants ∝ / and ∝ / . these relationships are satisfied simultaneously [7] . the area is preserved due to the constant rate of the fluid flow at different ann models and other machine learning techniques will likely become a popular tool for the analysis of properties of surface and colloidal systems. note that anns themselves do not possess any scaling properties relevant to the study of physical systems. however, anns represent a so-called biomimetic approach, because they attempt to mimic learning algorithms in human brains. therefore, it is of interest to investigate natural neural networks or cortical networks, and these will be discussed in the following section. let us start the discussion of scaling in biological objects from the scaling rules in branching networks, for which the typical example is the vascular system of mammals. according to the empirical allometric klieber law [1] formulated in 1932, metabolism rates, b, in various species are well approximated by a power-law scaling dependency on the mass of an animal, b ∝ m 0.75 . based on the metabolism rates, one can estimate other parameters in species including the lifespan. the value of the klieber law exponent, a = 0.75, remained mysterious until the seminal paper by west et al. [2] explained it using the fractal model of blood vessel branching. the blood vessels serve a certain volume with the simultaneous conservation of the cross-sectional area of the vessels and of the volume covered at every stage of branching. despite its shortcomings [3] [4] [5] [6] , this theory remains the main explanation of the allometric scaling exponents. according to west et al. [2] , when a tube with the length l k and radius r k branches into n tubes with the lengths l k+1 = γl k and radii r k+1 = βr k (where γ and β are constants), the volume served by the next-generation tubes and their cross-section area should conserve, which leads to two separate scaling relationships for the constants γ ∝ n −1/3 and β ∝ n −1/2 . these relationships are satisfied simultaneously [7] . the area is preserved due to the constant rate of the fluid flow at different hierarchical levels. the volume is preserved, assuming that the same volume in the organism is served by blood vessels of different hierarchical levels ( figure 10 ). on the metabolism rates, one can estimate other parameters in species including the lifespan. the value of the klieber law exponent, a = 0.75, remained mysterious until the seminal paper by west et al. [2] explained it using the fractal model of blood vessel branching. the blood vessels serve a certain volume with the simultaneous conservation of the cross-sectional area of the vessels and of the volume covered at every stage of branching. despite its shortcomings [3] [4] [5] [6] , this theory remains the main explanation of the allometric scaling exponents. according to west et al. [2] , when a tube with the length lk and radius rk branches into n tubes with the lengths lk+1= γlk and radii rk+1= βrk (where γ and β are constants), the volume served by the next-generation tubes and their cross-section area should conserve, which leads to two separate scaling relationships for the constants ∝ / and ∝ / . these relationships are satisfied simultaneously [7] . the area is preserved due to the constant rate of the fluid flow at different hierarchical levels. the volume is preserved, assuming that the same volume in the organism is served by blood vessels of different hierarchical levels ( figure 10 ). the total volume of fluid in the vascular system can be calculated as a sum at different levels of the hierarchy using the summation of the geometric series where v 0 is a certain elementary volume (e.g., volume served by a capillary), and n is the total number of branch generations. therefore, the volume scales as v ∝ γβ 2 −n . from this, the scaling dependency of the total number of thinnest capillaries as a function of volume is n n ∝ v a ∝ γβ 2 −na ∝ n −4/3 −na ∝ n 4na/3 yielding a = 3/4, the well-established empirical results known as the kleiber law, which is based on the assumption of a constant flow rate. the model by west et al. [2] , while simplified, is important at the conceptual level. it was suggested that the fractal scaling of the vascular system may explain the non-ergodicity of the blood flow [7] . branching in the vascular network provides a classical mechanism for estimating scaling, which can be applied to more complex neuron networks. in this section, we will review certain aspects of the current knowledge about the cortical networks in human and animal brains related to their scaling and self-organizing properties. this area of neuroscience is rapidly developing and intersects with the network science in many instances. since this area is less known to biophysicists, colloidal scientists, and engineers, we will introduce some concepts on the structure of cortical networks prior to discussing their scaling and self-organizing properties. let us start from the discussion of the architecture of the cortical network. neuron connections in the human brain constitute a very complex network of about 10 10 neurons with more than 10 14 synapses connecting between them. while it is extremely difficult to study such a complex network, a number of important insights have been achieved, particularly, since the early 2000s. this knowledge was obtained due to novel methods of in vivo observation of neural activity, including the electroencephalography (eeg), functional magnetic resonance imaging (fmri), diffusion tensor imaging (dti), two-photon excitation microscopy (tpef or 2pef), and positron emission tomography (pet). many insights were also achieved using the comparison or analogy of the human brain with the neural system of much simpler organisms, such as the nematode caenorhabditis elegans (a tiny worm with the size of less than 1 mm), which has only 302 neurons and about 6398 synapses. since a complete connectome (a map of neuron connections) and genome have been obtained and published for c. elegans, in 1986 [34] , this worm serves as a model organism for genetic and neurological research including, for example, 3d simulations of its motion and behavior [35] . as far as the human brain, the higher-order brain functions, such as cognition, language, sensory perception, and the generation of motor commands are associated with the neocortex. the neocortex is an external part of the brain, which is 3-4 mm thick and with the surface area of 0.26 m 2 . the neocortex is made of six distinct layers of neurons, and it consists of 10 8 cortical mini-columns with the diameter of about 50-60 µm spanning through all six layers, with about 100 neurons in each mini-column ( figure 11 ). although the functionality of the microcolumns (and their very existence) is being debated by some researchers, the columnar structure of the neocortex is widely accepted by most neuroscientists. the microcolumns are combined into the large hyper-columns or macro-columns, which are 300-500 µm in diameter. the hyper-columns have a roughly hexagonal shape, and each column is surrounded by six other columns. each hyper-column, by some estimates, may include 60-80 microcolumns [36] [37] [38] . while little is known about the processes inside the microcolumns, it is widely believed that a cortical column can process a number of inputs, and it converts them to a number of outputs using overlapping internal processing chains. each minicolumn is a complex local network that contains elements for redundancy and plasticity. the minicolumn unites vertical and horizontal cortical components, and its design has evolved specifically in the neocortex. although minicolumns are often considered highly repetitive clone-like units, they display considerable heterogeneity between cortex areas and sometimes even within a given macrocolumn. a comprehensive map of neural connections in the brain is called the connectome [39] . a connectome of the c. elegans worm has been obtained in 1986, while obtaining a human brain connectome remains a much more challenging task of the scientific discipline referred to as connectomics (compare with genome and genomics or proteome and proteomics). a much more complex connectome of the drosophila melanogaster fruit fly, a model insect used for various genetic research, whose brain contains about 135,000 neurons and 10 7 synapses, has been presumably obtained by 2020 [40] . it is believed that the human connectome can be studied at three distinct levels of structural organization: the microscale (connection of single neurons), mesoscale (cortical columns), and the macroscale (anatomical regions of interest in the brain). quantitative estimates of the brain network characteristics are remarkable. a measure of the diameter (largest geodesic distance) of the scale-free network of n nodes was suggested by bollobás and riordan [41] as d=log(n)/log(logn). according to freeman and breakspear [42] , the neocortical diameter of each hemisphere is close to d = 12, for neurons 0.5 × 10 10 neurons and 10 4 synapses per neuron yielding n = 5 × 10 13 . these numbers are consistent with the idea that at least a three-level while little is known about the processes inside the microcolumns, it is widely believed that a cortical column can process a number of inputs, and it converts them to a number of outputs using overlapping internal processing chains. each minicolumn is a complex local network that contains elements for redundancy and plasticity. the minicolumn unites vertical and horizontal cortical components, and its design has evolved specifically in the neocortex. although minicolumns are often considered highly repetitive clone-like units, they display considerable heterogeneity between cortex areas and sometimes even within a given macrocolumn. a comprehensive map of neural connections in the brain is called the connectome [39] . a connectome of the c. elegans worm has been obtained in 1986, while obtaining a human brain connectome remains a much more challenging task of the scientific discipline referred to as connectomics (compare with genome and genomics or proteome and proteomics). a much more complex connectome of the drosophila melanogaster fruit fly, a model insect used for various genetic research, whose brain contains about 135,000 neurons and 10 7 synapses, has been presumably obtained by 2020 [40] . it is believed that the human connectome can be studied at three distinct levels of structural organization: the microscale (connection of single neurons), mesoscale (cortical columns), and the macroscale (anatomical regions of interest in the brain). quantitative estimates of the brain network characteristics are remarkable. a measure of the diameter (largest geodesic distance) of the scale-free network of n nodes was suggested by bollobás and riordan [41] as d = log(n)/log(logn). according to freeman and breakspear [42] , the neocortical diameter of each hemisphere is close to d = 12, for neurons 0.5 × 10 10 neurons and 10 4 synapses per neuron yielding n = 5 × 10 13 . these numbers are consistent with the idea that at least a three-level hierarchy exists formed by nodes as neurons, hypercolumns, and modules. the reduction of 10 10 neurons and 10 14 synapses to a depth of three levels and a diameter of d = 12 is viewed as a simplification [42] . klimm et al. [43] estimated the quantitative characteristics of the human brain network including the hierarchy of the network and its fractal topological dimension. the hierarchy of a network, β, is defined quantitatively by the presumed power law relationship between the node degree and the local clustering coefficient (the ratio of the triangle subgraphs to the number of node triples), after discussing the architectural structure of the cortical networks, let us briefly review what is known about the formation of such a complex network, from both an ontogenetic and phylogenetic point of view. the question of to what extent the brain wiring is coded in the dna remains controversial. the human brain cortex contains at least 10 10 neurons linked by more than 10 14 synaptic connections, while the number of base pairs in a human genome is only 0.3 × 10 10 . therefore, it is impossible that the information about all synaptic connections is contained in the dna. currently, two concepts, namely, the protomap hypothesis and the radial unit hypothesis, which complement each other, are employed to explain the formation of the neo-cortex. both hypotheses were suggested by pasko rakic [44] . the protomap is a term for the original molecular "map" of the mammalian cerebral cortex with its functional areas during early embryonic development when neural stem cells are still the dominant cell type. the protomap is patterned by a system of signaling centers in the embryo, which provide information to the cells about their position and development. this process is referred to as the "cortical patterning". mature functional areas of the cortex, such as the visual, somatosensory, and motor areas are developed through this process. during the mammalian evolution, the area of the cortical surface has increased by more than 10 3 times, while its thickness did not change significantly. this is explained by the radial unit hypothesis of cerebral cortex development, which was first described by pasko rakic [44] [45] [46] . according to this hypothesis, the cortical expansion is the result of the increasing number of radial columnar units. the increase occurs without a significant change in the number of neurons within each column. the cortex develops as an array of interacting cortical columns or the radial units during embryogenesis. each unit originates from a transient stem cell layer. the regulatory genes control the timing and ratio of cell divisions. as a result, an expanded cortical plate is created with the enhanced capacity for establishing new patterns of connectivity that are validated through natural selection [36] . an interesting observation about the human connectome was made by kerepesi et al. [47] . by analyzing the computer data of the "budapest reference connectome", which contains macroscale connectome data for 418 individuals, they identified common parts of the connectome graphs between different individuals. it was observed that by decreasing the number of individuals possessing the common feature from 418 down to 1, more graph edges appeared. however, these new appearing edges were not random, but rather similar to a growing "shrub". the authors called the effect the consensus connectome dynamics and hypothesized that this graph growth may copy the temporal development of the connections in the human brain, so that the older connections are present in a greater number of subjects [48] . an important model was suggested recently by barabási and barabási [49] , who attempted to explain the neuronal connectivity diagram of the c. elegans nematode worm by considering neuron pairs that are known to be connected by chemical or electrical synapses. since synaptic wiring in the c. elegans is mostly invariant between individual organisms, it is believed that this wiring is genetically encoded. however, identifying the genes that determine the synaptic connections is a major challenge. barabási and barabási [49] identified a small set of transcription factors responsible for the synapses formation of specific types of neurons by studying bicliques in c. elegans' connectome. according to their model, a set of log 2 (n) transcription factors is sufficient to encode the connection, if transcription factors are combined with what they called the biological operators. it was proposed that soc plays a role in the formation of the brain neural network [18, 50, 51] . the neural connectivity is sparse at the embryonic stage. after the birth, the connectivity increases and ultimately reaches a certain critical level at which the neural activity becomes self-sustaining. the brain tissue as a collective system is at the edge of criticality. through the combination of structural properties and dynamical factors, such as noise level and input gain, the system may transit between subcritical, critical, and supercritical regimes. this mechanism is illustrated in figure 12a [51] . the network evolves toward regions of criticality or edge-of-criticality. once critical regions are established, the connectivity structure remains essentially unchanged. however, by adjusting the noise and/or gain levels, the system can be steered toward or away from critical regions. according to freeman and breakspear [42] , the power-law distribution of axonal length (figure 12b ) is the evidence of scale-free activity. to freeman and breakspear [42] , the power-law distribution of axonal length (figure 12b ) is the evidence of scale-free activity. self-organizing critical behavior was reported by liu et al. [52] for the organization of brain gabaa receptors (these are receptors of γ-aminobutyric acid or gaba, the major neurotransmitter). the mean size of receptor networks in a synapse followed a power-law distribution as a function of receptor concentration with the exponent 1.87 representing the fractal dimension of receptor networks. the results suggested that receptor networks tend to attract more receptors to grow into larger networks in a manner typical for soc systems that self-organize near critical states. an amazing feature of brain operations is that they are distributed, rather than localized at a particular neuron or a group of neurons. the distributed operations are performed by a collection of processing units that are spatially separate and communicate by exchanging messages. mountcastle [36] formulated the following properties of such distributed systems: • signals from one location to another may follow any of a number of pathways in the system. this provides the redundancy and resilience. • actions may be initiated at various nodal loci within a distributed system rather than at one particular spot. local lesions within a distributed system usually may degrade a function, but not eliminate it completely. the nodes are open to both externally induced and internally generated signals. various aspects of scaling behavior have been studied for networks associated with the brain, including both special and temporal structures. several neuroscientists suggested in the 2000s that self-organizing critical behavior was reported by liu et al. [52] for the organization of brain gaba a receptors (these are receptors of γ-aminobutyric acid or gaba, the major neurotransmitter). the mean size of receptor networks in a synapse followed a power-law distribution as a function of receptor concentration with the exponent 1.87 representing the fractal dimension of receptor networks. the results suggested that receptor networks tend to attract more receptors to grow into larger networks in a manner typical for soc systems that self-organize near critical states. an amazing feature of brain operations is that they are distributed, rather than localized at a particular neuron or a group of neurons. the distributed operations are performed by a collection of processing units that are spatially separate and communicate by exchanging messages. mountcastle [36] formulated the following properties of such distributed systems: • signals from one location to another may follow any of a number of pathways in the system. this provides the redundancy and resilience. • actions may be initiated at various nodal loci within a distributed system rather than at one particular spot. local lesions within a distributed system usually may degrade a function, but not eliminate it completely. the nodes are open to both externally induced and internally generated signals. various aspects of scaling behavior have been studied for networks associated with the brain, including both special and temporal structures. several neuroscientists suggested in the 2000s that the human brain network is both scale-free and small-world, although the arguments and evidence for these hypotheses are indirect [42, 53] , including power-law distributions of anatomical connectivity as well as the statistical properties of state transitions in the brain [54] . freeman and breakspear [42] suggested that if neocortical connectivity and dynamics are scale-free, hubs should exist for most cognitive functions, where activity and connections are at a maximum. these hubs organize brain functions at the microscopic and mesoscopic level. they are detectable by macroscopic imaging techniques such as fmri. therefore, these are hubs rather than localized functions, which are revealed by these imaging techniques. when connection density increases above a certain threshold, a scale-free network undergoes an avalanche-like abrupt transition and resynchronization almost instantaneously independent of its diameter. scale-free dynamics can explain how mammalian brains operate on the same time scales despite differences in size ranging to 10 4 (mouse to whale), as it will be discussed in more detail in consequent sections. random removals of nodes from scale-free networks have negligible effects; however, lesions of hubs are catastrophic. examples in humans are coma and parkinson's disease from small brain stem lesions [42] . avalanches are a common characteristic of brain signals, along with the so-called bursting [55] [56] [57] . neural avalanches show such characteristics as power-law distributions, which is believed to be an indication of near critical behavior [57] . thus figure 13a , redrawn from ref. [57] , suggests that the actual size distribution of neuronal avalanches is a power law and it is different from the poisson model distribution of uncorrelated activities. it is further believed that whether the avalanche occurs depends on the branching regime during the neuron connection (figure 13b ) [57] [58] [59] [60] [61] . increases above a certain threshold, a scale-free network undergoes an avalanche-like abrupt transition and resynchronization almost instantaneously independent of its diameter. scale-free dynamics can explain how mammalian brains operate on the same time scales despite differences in size ranging to 10 4 (mouse to whale), as it will be discussed in more detail in consequent sections. random removals of nodes from scale-free networks have negligible effects; however, lesions of hubs are catastrophic. examples in humans are coma and parkinson's disease from small brain stem lesions [42] . avalanches are a common characteristic of brain signals, along with the so-called bursting [55] [56] [57] . neural avalanches show such characteristics as power-law distributions, which is believed to be an indication of near critical behavior [57] . thus figure 13a , redrawn from ref. [57] , suggests that the actual size distribution of neuronal avalanches is a power law and it is different from the poisson model distribution of uncorrelated activities. it is further believed that whether the avalanche occurs depends on the branching regime during the neuron connection (figure 13b ) [57] [58] [59] [60] [61] . (a) (b) figure 13 . (a) avalanche size (log-log scale) distributions in brain shows a power-law dependency [57] (b) the activity may decrease, stay at the same level, or grow with time depending on the branching regime (based on [57] ). speaking of the human brain operation, it would be disappointing to avoid such an intriguing topic as what is currently known about the possible connection of brain activities to cognition. while identifying parts of the brain responsible for higher order brain activity, such as cognition or speech, figure 13 . (a) avalanche size (log-log scale) distributions in brain shows a power-law dependency [57] (b) the activity may decrease, stay at the same level, or grow with time depending on the branching regime (based on [57] ). speaking of the human brain operation, it would be disappointing to avoid such an intriguing topic as what is currently known about the possible connection of brain activities to cognition. while identifying parts of the brain responsible for higher order brain activity, such as cognition or speech, and understanding their mechanisms remains a remote (if at all solvable) task, a number of significant observations have been made in the past 20 years. some of these observations are related to the temporal scales involved. in the 1990s, b. biswal, a graduate student at the medical college of wisconsin (mcw), discovered that the human brain displays so-called "resting state connectivity", which is observable in the fmri scans [62] . the phenomenon was later called the default mode network (dmn), and it describes brain functions of a resting state. the dmn is active when a person is not focused on the outside world, and the brain is at wakeful rest, such as during daydreaming and mind-wandering. it is also active when individuals are thinking about others, thinking about themselves, remembering the past, and planning for the future. the dmn has been shown to be negatively correlated with other networks in the brain such as attention networks, although the former can be active in certain goal-oriented tasks such as social working memory or autobiographical tasks. several recent studies have concentrated upon dmn's relationship to the perception of a temporal sequence of events, as well as to its role in speech and language-related cognition. these features are of particular interest to the philosophy of mind because language, the ability to plan activities, introspection, and understanding the perspective of another person are often described as distinct characteristic features of humans, which separate them from other mammals. konishi et al. [63] investigated the hypothesis that the dmn allows cognition to be shaped by memory-stored information rather than by information in the immediate environment, or, in other words, by "past" rather than by "now". using the fmri technique, they investigated the role of the dmn when people made decisions about where a shape was, rather than where it is now. the study showed that dmn hubs are responsible for the cognition guided by information belonging to the past or to the future, instead of by immediate perceptual input. on the basis of these observations, konishi et al. [63] suggested that the dmn is employed for higher order mental activities such as imagining the past or future and considering the perspective of another person. these complex introspective activities depend on the capacity for cognition to be shaped by representations that are not present in the current external environment. in a different study, lerner et al. [64] investigated how human activities involving the integration of information on various time-scales is related to the dmn activation. during real-time lasting activities, such as watching a movie or engaging in conversation, the brain integrates information over multiple time scales. the temporal receptive window (the length of time before a response during which sensory information may affect that response) becomes larger when moving from low-level sensory to high-level perceptual and cognitive areas. lerner et al. [64] showed that the temporal receptive window has a hierarchical organization with levels of response to the momentary input, to the information at the sentence time scale, and to the intact paragraphs that were heard in a meaningful sequence. the researchers further hypothesized that the processing time scale is a functional property that may provide a general organizing principle for the human cerebral cortex. in a neurolinguistics study, honey et al. [65] performed fmri research to figure out whether different languages affect different patterns in neural response. they made bilingual listeners hear the same story in two different languages (english and russian). the story evoked similar brain responses, which were invariant to the structural changes across languages. this demonstrated that the human brain processes real-life information in a manner that is largely insensitive to the language in which that information is conveyed. simony et al. [66] further investigated how the dmn reconfigures to encode information about the changing environment by conducting fmri while making subjects listen to a real-life auditory narrative and to its temporally scrambled versions. the momentary configurations of dmn predicted the memory of narrative segments. these are interesting studies that may provide insights into questions such as how the natural language is related to the hypothetical language of thought, which has been postulated by some cognitive scientists and philosophers of language. the suggestion that the brain's connectome as a network possesses topological properties of fractal, scale-free, or small-world networks brings a number of interesting questions. the hierarchically organized network with the fractal dimension of d (some estimates state the value of the fractal dimension d = 3.7 ± 0.1 [43] ) is packed into the 3d cortex, forming a thin layer whose thickness is almost two orders of magnitude smaller than its two other dimensions. barabási and barabási [49] noted that in order to store the information about the exact structure of the connectome of n neurons, a neuron with k links would need k·log 2 (n) bits of information, with the total information in all neurons kn·log 2 (n). for large organisms, this would significantly exceed the information contained in dna (table 3) . thus, 3.1·10 21 bits of information would be required to characterize the human brain; for comparison, some estimates indicate that human brain memory capacity is 10 15 -10 16 bit, while the human genome has about 10 11 pairs of nucleotides. to overcome this difficulty, barabási and barabási [49] suggested a selective coding model. according to their model, the brain cannot encode each link at the genetic level. instead, selective operators are employed, which inspect the transcription factor signatures of two neurons (somewhat close in space) and facilitate or block the formation of a directed link between them. this can be achieved by external agents, such as glia cells, which select specific neurons and facilitate synapse formation between them or detect the combinatorial expression of surface proteins, whose protein-protein interactions catalyze synapse formation. the action of such selective operators is evidenced by the emergence of detectable network motifs in the connectome, namely, bicliques. the model by barabási and barabási [49] could provide an insight into the question of how much of the information in the structure of the brain is contained in the dna and how much is generated during the embryonal and post-embryonal development by self-organizing processes. the lower boundary of genetic information can be estimated by the number of transcription factors, t, which encode the identity of a neuron times the total number of neurons. in this section, we will review current experimental data about the scaling properties of cortical networks related to their spatial and temporal organization and their informational content from the entropic viewpoint. there are several approaches to what constitutes a "time constant" for the brain and how this time constant (i.e., the rate of neural processes) scales with the size of an animal. brain networks show a number of remarkable properties. one important experimental observation is that despite the difference in their size by 10 4 times from a mouse to a whale, mammalian brains tend to operate at almost the same time scales. this can be called the law of conservation of the characteristic time scale. there are two approaches to the characterization of the time scale of brain activity of different creatures: studying brain waves (rhythms of oscillation) and investigations of the critical flicker fusion (cff) thresholds. the cff is defined as the frequency of an intermittent light, at which the light appears steady to a human or animal observer (similar to frames in the cinema). it has been hypothesized that the ability of an animal to change their body position or orientation (manoeuvrability) is related to the ability to resolve temporal features of the environment and eventually to the cff [67] . manoeuvrability usually decreases with increasing body mass. buzsáki et al. [68] reviewed the temporal scaling properties of cortical networks by studying the hierarchical organization of brain rhythms in mammals. the brain is known to generate electromagnetic oscillations of different frequencies, which can be observed with the eeg. while the exact nature and function of these oscillations remains debatable, they are highly reproducible and classified by their frequencies: alpha waves (7-15 hz), beta waves (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) , gamma waves (>30 hz), and others. the frequency of brain oscillations covers almost five orders of magnitude, from <0.01 hz to hundreds of hz. the power distribution of brain oscillations tends to show the 1/f n (where f is the frequency and n is the power exponent) noise spectrum when measured at long-scale ranges. as we have discussed in the preceding sections, such a distribution spectrum is a signature of soc. however, when specific brain activities are considered, such as concentrating on particular features, moving or orienting in space or various cognitive functions, particular oscillation frequencies become dominant, and the spectrum deviates from the 1/f or 1/f n statistics, showing peaks at some characteristic frequencies. these frequencies of various rhythm classes do not vary significantly with the size of the brain [68] . as far as modeling the origin of oscillations, several dynamic models have been employed to study brain rhythms at various scales and, in particular, at the mesoscale. thus, the so-called freeman's k-sets model follows the katzir-katchalsky suggested treatment of cell assemblies using network thermodynamics. hierarchical models of several layers of these sets, from k-i to k-v, provide a modeling tool to conduct analyses of the unifying actions of neocortex related to intentional and cognitive behaviors [42] . the correlation of spatial and temporal brain activity organization was studied by honey et al. [65] . they related the spontaneous cortical dynamics to the underlying anatomical connectivity at multiple temporal scales. structural (spatial) hubs corresponded to the hubs of the long-run (minutes) neural activity. for the activities with shorter characteristic time (seconds), significant deviations from special hubs were observed. at the shorter time scale (fraction of a second), individual episodes of interregional phase-locking were detected. the critical flicker fusion threshold is viewed by many researchers as the time scale (or frequency) at which the brain operates. the higher frequency of the cff threshold implies the brain's ability to discern signals and react faster. healy et al. [67] studied the effect of body mass and metabolic rates of various species on their cff threshold ( figure 14) . the comparative metabolic rates are determined separately by measuring oxygen consumption through ventilation [69] . it is expected that smaller animals have higher temporal resolution. larger animals respond to a stimulus slower than the smaller animals. therefore, high temporal resolution in larger animals is unnecessary. on the other hand, faster and more manoeuvrable fly species have higher temporal resolutions [70] , which makes it, for example, so difficult for a human to catch a fly. note that the mass-specific metabolic rates are almost constant (or, more accurate to say, lie within a certain relatively narrow range) for different life forms with 20 orders of magnitude difference in the body mass [68] . furthermore, the accuracy of the 3/4-power allometric scaling kleiber law for metabolic rate is not considered universal by some scholars [4] . range) for different life forms with 20 orders of magnitude difference in the body mass [68] . furthermore, the accuracy of the 3/4-power allometric scaling kleiber law for metabolic rate is not considered universal by some scholars [4] . we conclude that the characteristic frequencies of brain activity are either almost constant or slightly decrease with increasing body and brain size. (a) (b) figure 14 . the effect of (a) body mass (gram) and (b) temperature-corrected mass-specific resting metabolic rate (qwg) on the critical flicker fusion (cff) shows that the cff increases with the metabolic rate but decreases with body mass (based on [67] ). figure 14 . the effect of (a) body mass (gram) and (b) temperature-corrected mass-specific resting metabolic rate (qwg) on the critical flicker fusion (cff) shows that the cff increases with the metabolic rate but decreases with body mass (based on [67] ). we conclude that the characteristic frequencies of brain activity are either almost constant or slightly decrease with increasing body and brain size. the allometric approach can be applied to the inter-species analysis of brain activity frequencies and time scales in both humans and various species. the observation that typical frequencies of brain activities are often independent of brain size requires an explanation. when brains of various species are compared, the distance between homologous regions of the brain increases with the growing size of the brain. moreover, the number and length of axons increase even more rapidly than the number of neurons with the growing brain size. consequently, the average number of synaptic connections in the shortest path between two neurons (the "synaptic path length") also grows in a very fast manner. assuming that the cortical network is a scale-free and/or a small-world network would decrease the path length; however, it does not eliminate completely the scale dependency. the mechanisms that facilitate the increase of signal speed in larger species should be investigated. such an investigation was suggested by buzsáki et al. [68] . they used experimental data about the scaling of the brain, in particular, those presented by wang et al. [71] on the size of axons and the amount of white matter in the brain. these are parameters that affect the signal speed. the experimental observations indicate that the increase in axon caliber (size) and their insulation (myelination) compensates for the increased time of signal transfer. the volume of the myelin or white matter in relation to the gray matter (neurons) in the brain scales as power 4/3 with the size. for instance, while the white matter constitutes 6% of the neocortical volume in hedgehogs, in humans, it exceeds 40% of brain volume. this is because a larger brain requires faster signal propagation, and the degree of myelination in larger brains increases. the velocity of propagation of the neural signal is another significant parameter. buzsáki et al. [68] noted that for the phase synchronization of gamma waves in a mouse (brain size approximately 5-10 mm), the speed of conduction of 5 m/s is sufficient, while for humans (70-140 mm) , much larger conduction speeds are needed [60] . while an increase in conduction velocity can be achieved by both the increase of the volume fraction of white matter and of the axon size, there are several problems associated with this approach. an increase of axon diameter proportional to the size of the brain would enormously increase the brain volume. experimental observation suggests that, apparently, only a small fraction of all axons have large diameters. particularly, the largest axons scale linearly with the brain size (figure 15 ), and they result in an increase of the connection speed [68, 71] . at the same time, the required increase of the white matter volume in larger brains does not lead to an unreasonable growth of the volume, because only a small fraction of all axons is large. consequently, despite a 17,000-fold difference in mammalian brain volumes, the oscillation rhythms with their typical time scales independent of the brain size are supported by the same mechanisms, and they still have the same typical frequencies. = const (10) which can be presented as = = * ln( ) . equation 10 relates the ratio of white-to-gray matter to the characteristic size of the brain. the relationship is plotted in figure 16 for the value of d = 3.2, for several values of the exponent d. the experimental data, based on [71] , for several animals are also presented. it is seen that the best agreement with equation 11 is for 2 < d < 3, which is consistent with the concept that the growth of the white matter content compensates for the increasing linear size of the brain. (a) (b) (c) (d) figure 15 . interspecies scaling relations in the brain (based on [68] and [71] ). (a) cross-brain conduction times for myelinated axons; (b) the fraction of myelinated axons; (c) the fraction of volume filled by axons; (d) distribution of axon densities. figure 16 . scaling relationship between the brain diameter (cm) and the ratio of white and gray matter. an interesting spin-off of the network approach to the neural science has been developed in the field of immunology, where niels jerne [72] and geoffrey hoffmann [73] have suggested the socalled immune network theory (int). according to this theory, the immune system of a human or of an animal is a network of lymphocytes (immune cells) and molecules (antibodies, such as immunoglobulins), which interact with each other. an invasion of a foreign antigen a (which may be a virus, microbe, protein, or even an inorganic compound) activates immune cells and molecules the number of nodes, n, scales proportionally to the power d of the characteristic length size of the brain, l, as n ∼ l d where d is the fractal dimension of the network. the velocity of neural signal propagation is dependent on the ratio w = w/g of the volume fractions of the white matter, w, to the gray matter, g, as a power function, as we have discussed in the preceding sections, in small-world networks, the distance between two random nodes is proportional to the logarithm of the total number of nodes, l ∼ ln(n) ∼ d· ln(l). the rate of neural processes, i.e., the time scale of the brain activity is related to the size of an animal. however, given the experimental observation that the rate is almost a constant independent of the size of the brain, one can assume that the volume fraction of white-to-gray matter affects the velocity of the signal from equation (9), it follows immediately that d· ln(l) which can be presented as equation (10) relates the ratio of white-to-gray matter to the characteristic size of the brain. the relationship is plotted in figure 16 for the value of d = 3.2, for several values of the exponent d. the experimental data, based on [71] , for several animals are also presented. it is seen that the best agreement with equation (11) is for 2 < d < 3, which is consistent with the concept that the growth of the white matter content compensates for the increasing linear size of the brain. (a) (b) (c) (d) figure 15 . interspecies scaling relations in the brain (based on [68] and [71] ). (a) cross-brain conduction times for myelinated axons; (b) the fraction of myelinated axons; (c) the fraction of volume filled by axons; (d) distribution of axon densities. scaling relationship between the brain diameter (cm) and the ratio of white and gray matter. an interesting spin-off of the network approach to the neural science has been developed in the field of immunology, where niels jerne [72] and geoffrey hoffmann [73] have suggested the socalled immune network theory (int). according to this theory, the immune system of a human or of an animal is a network of lymphocytes (immune cells) and molecules (antibodies, such as immunoglobulins), which interact with each other. an invasion of a foreign antigen a (which may be a virus, microbe, protein, or even an inorganic compound) activates immune cells and molecules figure 16 . scaling relationship between the brain diameter (cm) and the ratio of white and gray matter. an interesting spin-off of the network approach to the neural science has been developed in the field of immunology, where niels jerne [72] and geoffrey hoffmann [73] have suggested the so-called immune network theory (int). according to this theory, the immune system of a human or of an animal is a network of lymphocytes (immune cells) and molecules (antibodies, such as immunoglobulins), which interact with each other. an invasion of a foreign antigen a (which may be a virus, microbe, protein, or even an inorganic compound) activates immune cells and molecules anti-a, which, in turn, activate anti-anti-a, and so on. the nodes of the network are immune cells, antibodies, and antigens, while the edges are interactions between them. therefore, the reaction of the immune system on the antigen is somewhat similar to the reaction of a network upon a stimulus applied to one of its nodes: it may be a local perturbation or an avalanche-type response affecting many nodes of the system. according to jerne, there are many similarities between the immune system and the central nervous system. the numbers of cells in both systems are of the same order, 10 10 -10 11 . the mass of immune cells in a human is on the order of 1 kg, which is somewhat comparable with the brain mass, at least, by the order of magnitude. both the immune and nervous systems respond to external stimuli, and they both possess memory [73] . the int theory was suggested in the 1970s. although jerne became a winner of the 1984 nobel prize for his work in immunology, the int remains a hypothesis that may require further experimental validation. jerne apparently sought an analogy of the int not only with the principals of cortical network organization, but also with the rules, which govern human language, such as the chomskian concept of the generative grammar. jerne's nobel lecture is entitled "the generative grammar of the immune system" [72] . since the turn of the 21st century, there have been attempts to investigate the scaling properties of the immune network, including their scale-free and fractal properties [74, 75] . this is often conducted in the context of a broader area of the protein networks, since the interactions between antigens and antibodies or immune cells are bio-specific (ligand-receptor) protein-protein interactions [76] . conceptually, these efforts are supposed to give insights, for example, on autoimmune diseases. one can hypothesize that, similarly to the soc sand-pile model, where the addition of a sand grain usually has a local effect, but sometimes it can cause avalanches, the reaction of the immune network to an external stimulus (a pathogenic antigen) can be an immune response or a catastrophic series of events leading to a disease. however, the int still largely misses a connection with experimental science. the term "immunome" (and therefore, the area of "immunomics") has already been suggested [77] as an analogy for the genome (and genomics), proteome (and proteomics), and connectome (and connectomics). as far as the protein networks, it has been shown that hydrophobic interactions during protein folding result in the soc mechanisms that can explain scaling properties and power-law exponents (e.g., size-volume dependencies) in proteomics [78] . the relation of hydrophobic interactions and soc has been established [14] , and similar considerations have been used in materials science, in particular, for the design of anti-icing coatings [79] . networks of cortical neurons in the brain are a source of inspiration for the area of biomimetic ann. computational algorithms inspired by the int have been suggested as well [80] . according to the clonal selection theory of immunity, when a pathogen invades the organism, immune cells (lymphocytes) that recognize these pathogens would proliferate yielding effector cells with secrete antibodies (immunoglobulin proteins) and memory cells. the multitude of available immune cells is explained as a result of somatic mutations with high rates, while pathogens drive their selection force. a rough model of this process is used as a basis for genetic computational algorithms called the artificial immune systems (ais) algorithms [81] . methods of the network science and information theory can be used for the analysis of diverse types of physicochemical and biological systems. the systems reviewed in this article include granular materials, droplet clusters, colloidal crystals, artificial neural networks, biological vascular networks, cortical networks of neurons connected by synapses, and immune networks. scaling, topology, and dimensional analysis can apply to networks that describe these different physical and biophysical systems. some of these networks possess fractal, scale-free, and small-world scaling properties. others exhibit different types of scaling relationships often involving power laws. the amount of information contained in a network can be found by calculating its shannon entropy. we discussed the properties of colloidal networks arising from small granular and colloidal particles and from droplet clusters due to pairwise interaction between the particles. many networks found in colloidal science possess self-organizing properties due to the effect of percolation. the force networks in packed granular material leading to the jamming transition are a typical example. other systems may exhibit self-organized criticality. they are brought to a critical point by a combination of slow motion and a dissipation mechanism, which can balance out. these systems have critical states, which have distinct signatures of fractal dimensions and power laws (often one-over-frequency spectrum), although, of course, not every system with a one-over-frequency spectrum is an soc system. colloidal systems exhibit various scaling relationships including the fractal (scale-free), zipf, lewis, desch, and aboav scaling laws. branching vascular systems demonstrate the allometric power exponents of 3 4 . this power exponent in such systems is explained by a fractal model of branching with the simultaneous conservation of the volume served by blood vessels at different levels and the flow rate. then, we discussed much more complex networks of neurons, which are organized in the neocortex in a hierarchical manner, forming micro-and macro-columns. the scaling relationships in these networks suggest that the characteristic time constant is independent of brain size when interspecies comparison is conducted. this is because the increased diameter of the network is compensated by the increasing velocity of the signal due to myelination (the insulation of neurons by the white matter). the characteristic time constant can be defined in terms of the frequency of different types of brain waves or as the cff threshold. the brain networks possess many characteristics typical to other networks, including the one-over-frequency and power-law activities, avalanches, small-world, scale-free, and fractal topography. it is particularly interesting to look for the correlation between the spatial distribution (for example, hubs) and temporal organization (frequency spectrum) of human brain cognitive activities. such research is being conducted by many groups, for example, the study of the dmn during such activities as the comprehension of a text in a natural language versus contemplating it (the "language of thought"). the information content of the neural networks can be studied using the standard characteristics of the information theory, such as the shannon entropy. it may provide ways to distinguish between dna-encoded information and information generated during the embryonal and post-embryonal development, which may be driven by the self-organizing process. for engineers and computer scientists, neural networks serve as a source of inspiration for artificial neural networks, which can serve as a means of machine learning and establishing correlations in data-reach areas, such as surface engineering. cortical networks also served as a source of inspiration for the concept of the immune network, which, in turn, became an inspiration for artificial immune systems algorithms in computer science. both network science and brain physiology are dynamic, rapidly expanding fields. new approaches are likely to emerge. for example, in the study of small droplet clusters, unusual properties, such as the applicability of the dynkin diagrams for colloidal studies have been suggested. a huge amount of information has been obtained in recent studies about both the structure and properties of biological and artificial networks. there are still many questions and problems remaining, such as obtaining connectomes of various species, understanding the relation between the genomic information and organization of cortical networks, and the internal organization of the latter. body size and metabolism a general model for the origin of allometric scaling laws in biology is west, brown and enquist's model of allometric scaling mathematically correct and biologically relevant? beyond the "3/4-power law": variation in the intra-and interspecific scaling of metabolic rate in animals demystifying the west, brown & enquist model of the allometry of metabolism a general basis for quarter-power scaling in animals allometric scaling law and ergodicity breaking in the vascular system self-similarity, and intermediate asymptotics clustering and self-organization in small-scale natural and artificial systems statistical mechanics of complex networks emergence of scaling in random networks on random graphs the physics of networks do hierarchical mechanisms of superhydrophobicity lead to self-organized criticality? friction-induced vibrations and self-organization: mechanics and non-equilibrium thermodynamics of sliding contact resilience of the internet to random breakdowns collective dynamics of 'small-world' networks network robustness and fragility: percolation on random graphs contact force measurements and stress-induced anisotropy in granular materials what determines the static force chains in stressed granular media? apollonian networks: simultaneously scale-free, small world, euclidean, space filling, and with matching graphs small levitating ordered droplet clusters: stability, symmetry, and voronoi entropy langevin approach to modeling of small levitating ordered droplet clusters droplet clusters: nature-inspired biological reactors and aerosols characterization of self-assembled 2d patterns with voronoi entropy symmetry of small clusters of levitating water droplets two-dimensional clusters of colloidal spheres: ground states, excited states, and structural rearrangements cluster formation by acoustic forces and active fluctuations in levitated granular matter non-crystalline colloidal clusters in two dimensions: size distributions and shapes logical and information aspects in surface science: friction, capillarity, and superhydrophobicity ternary logic of motion to resolve kinematic frictional paradoxes machine learning methods to predict wetting properties of iron-based composites the structure of the nervous system of the nematode caenorhabditis elegans: the mind of a worm three-dimensional simulation of the caenorhabditis elegans body and muscle cells in liquid and gel environments for behavioural analysis the columnar organization of the neocortex the minicolumn hypothesis in neuroscience the cortical column: a structure without a function ome sweet ome: what can the genome tell us about the connectome? the diameter of the scale-free random graph scale-free neocortical dynamics resolving structural variability in network models and the brain specification of cerebral cortical areas a small step for the cell, a giant leap for mankind: a hypothesis of neocortical expansion during evolution evolution of the neocortex: a perspective from developmental biology how to direct the edges of the connectomes: dynamics of the consensus connectomes and the development of the connections in the human brain the robustness and the doubly preferential attachment simulation of the consensus connectome dynamics of the human brain a genetic model of the connectome neuro percolation: a random cellular automata approach to spatio-temporal neuro dynamics phase transitions in the neuro percolation model of neural populations with mixed local and non-local interactions mesophasic organization of gabaa receptors in hippocampal inhibitory synapse the small world of the cerebral cortex fine spatiotemporal structure of phase in human intracranial eeg neuronal avalanches in neocortical circuits neuronal avalanche phase transitions in scale-free neural networks: departure from the standard mean-field universality class using expression profiles of caenorhabditis elegans neurons to identify genes that mediate synaptic connectivity dynamics of a neural system with a multiscale architecture statistics and geometry of neuronal connectivity functional connectivity in the motor cortex of resting human brain using echoplanar mri shaped by the past: the default mode network supports cognition that is independent of immediate perceptual input topographic mapping of a hierarchy of temporal receptive windows using a narrated story not lost in translation: neural responses shared across languages dynamic reconfiguration of the default mode network during narrative comprehension metabolic rate and body size are linked with perception of temporal information scaling brain size, keeping timing: evolutionary preservation of brain rhythms mean mass-specific metabolic rates are strikingly similar across life's major domains: evidence for life's metabolic optimum photomechanical responses in drosophila photoreceptors functional trade-offs in white matter axonal scaling the generative grammar of the immune system nobel lecture the fractal immune network fractal immunology and immune patterning: potential tools for immune protection and optimization fractal proteins studying the human immunome: the complexity of comprehensive leukocyte immunophenotyping hydropathic self-organized criticality: a magic wand for protein physics anti-icing superhydrophobic surfaces: controlling entropic molecular interactions to design novel icephobic concrete a neural network model based on the analogy with the immune system an introduction to artificial immune systems: a new computational intelligence paradigm key: cord-011400-zyjd9rmp authors: peixoto, tiago p. title: network reconstruction and community detection from dynamics date: 2019-09-18 journal: nan doi: 10.1103/physrevlett.123.128301 sha: doc_id: 11400 cord_uid: zyjd9rmp we present a scalable nonparametric bayesian method to perform network reconstruction from observed functional behavior that at the same time infers the communities present in the network. we show that the joint reconstruction with community detection has a synergistic effect, where the edge correlations used to inform the existence of communities are also inherently used to improve the accuracy of the reconstruction which, in turn, can better inform the uncovering of communities. we illustrate the use of our method with observations arising from epidemic models and the ising model, both on synthetic and empirical networks, as well as on data containing only functional information. the observed functional behavior of a wide variety largescale system is often the result of a network of pairwise interactions. however, in many cases, these interactions are hidden from us, either because they are impossible to measure directly, or because their measurement can be done only at significant experimental cost. examples include the mechanisms of gene and metabolic regulation [1] , brain connectivity [2] , the spread of epidemics [3] , systemic risk in financial institutions [4] , and influence in social media [5] . in such situations, we are required to infer the network of interactions from the observed functional behavior. researchers have approached this reconstruction task from a variety of angles, resulting in many different methods, including thresholding the correlation between time series [6] , inversion of deterministic dynamics [7] [8] [9] , statistical inference of graphical models [10] [11] [12] [13] [14] and of models of epidemic spreading [15] [16] [17] [18] [19] [20] , as well as approaches that avoid explicit modeling, such as those based on transfer entropy [21] , granger causality [22] , compressed sensing [23] [24] [25] , generalized linearization [26] , and matching of pairwise correlations [27, 28] . in this letter, we approach the problem of network reconstruction in a manner that is different from the aforementioned methods in two important ways. first, we employ a nonparametric bayesian formulation of the problem, which yields a full posterior distribution of possible networks that are compatible with the observed dynamical behavior. second, we perform network reconstruction jointly with community detection [29] , where, at the same time as we infer the edges of the underlying network, we also infer its modular structure [30] . as we will show, while network reconstruction and community detection are desirable goals on their own, joining these two tasks has a synergistic effect, whereby the detection of communities significantly increases the accuracy of the reconstruction, which in turn improves the discovery of the communities, when compared to performing these tasks in isolation. some other approaches combine community detection with functional observation. berthet et al. [31] derived necessary conditions for the exact recovery of group assignments for dense weighted networks generated with community structure given observed microstates of an ising model. hoffmann et al. [32] proposed a method to infer community structure from time-series data that bypasses network reconstruction by employing a direct modeling of the dynamics given the group assignments, instead. however, neither of these approaches attempt to perform network reconstruction together with community detection. furthermore, they are tied down to one particular inverse problem, and as we will show, our general approach can be easily extended to an open-ended variety of functional models. bayesian network reconstruction.-we approach the network reconstruction task similarly to the situation where the network edges are measured directly, but via an uncertain process [33, 34] : if d is the measurement of some process that takes place on a network, we can define a posterior distribution for the underlying adjacency matrix a via bayes' rule where pðdjaþ is an arbitrary forward model for the dynamics given the network, pðaþ is the prior information on the network structure, and pðdþ ¼ p a pðdjaþpðaþ is a normalization constant comprising the total evidence for the data d. we can unite reconstruction with community detection via an, at first, seemingly minor, but ultimately consequential modification of the above equation where we introduce a structured prior pðajbþ where b represents the partition of the network in communities, i.e., b ¼ fb i g, where b i ∈ f1; …; bg is group membership of node i. this partition is unknown, and is inferred together with the network itself, via the joint posterior distribution the prior pðajbþ is an assumed generative model for the network structure. in our work, we will use the degreecorrected stochastic block model (dc-sbm) [35] , which assumes that, besides differences in degree, nodes belonging to the same group have statistically equivalent connection patterns, according to the joint probability with λ rs determining the average number of edges between groups r and s and κ i the average degree of node i. the marginal prior is obtained by integrating over all remaining parameters weighted by their respective prior distributions, which can be computed exactly for standard prior choices, although it can be modified to include hierarchical priors that have an improved explanatory power [36] (see supplemental material [37] for a concise summary.). the use of the dc-sbm as a prior probability in eq. (2) is motivated by its ability to inform link prediction in networks where some fraction of edges have not been observed or have been observed erroneously [34, 39] . the latent conditional probabilities of edges existing between groups of nodes is learned by the collective observation of many similar edges, and these correlations are leveraged to extrapolate the existence of missing or spurious ones. the same mechanism is expected to aid the reconstruction task, where edges are not observed directly, but the observed functional behavior yields a posterior distribution on them, allowing the same kind of correlations to be used as an additional source of evidence for the reconstruction, going beyond what the dynamics alone says. our reconstruction approach is finalized by defining an appropriate model for the functional behavior, determining pðdjaþ. here, we will consider two kinds of indirect data. the first comes from a susceptible-infected-susceptible (sis) epidemic spreading model [40] , where σ i ðtþ ¼ 1 means node i is infected at time t, 0, otherwise. the likelihood for this model is where is the transition probability for node i at time t, with fðp; σþ ¼ ð1 − pþ σ p 1−σ , and where m i ðtþ ¼ p j a ij lnð1 − τ ij þσ j ðtþ is the contribution from all neighbors of node i to its infection probability at time t. in the equations above, the value τ ij is the probability of an infection via an existing edge ði; jþ, and γ is the 1 → 0 recovery probability. with these additional parameters, the full posterior distribution for the reconstruction becomes since τ ij ∈ ½0; 1, we use the uniform prior pðτþ ¼ 1. note, also, that the recovery probability γ plays no role on the reconstruction algorithm, since its term in the likelihood does not involve a [and, hence, gets cancelled out in the denominator pðσjγþ ¼ pðγjσþpðσþ=pðγþ]. this means that the above posterior only depends on the infection events 0 → 1 and, thus, is also valid without any modifications to all epidemic variants susceptible-infected (si), susceptibleinfected-recovered (sir), susceptible-exposed-infectedrecovered (seir), etc., [40] , since the infection events occur with the same probability for all these models. the second functional model we consider is the ising model, where spin variables on the nodes s ∈ f−1; 1g n are sampled according to the joint distribution where β is the inverse temperature, j ij is the coupling on edge ði; jþ, h i is a local field on node i, and zða; β; j; hþ ¼ p s expðβ p i c ã ; 0; otherwiseg. the value of c ã was chosen to maximize the posterior similarity, which represents the best possible reconstruction achievable with this method. nevertheless, the network thus obtained is severely distorted. the inverse correlation method comes much closer to the true network, but is superseded by the joint inference with community detection. empirical dynamics.-we turn to the reconstruction from observed empirical dynamics with unknown underlying interactions. the first example is the sequence of m ¼ 619 votes of n ¼ 575 deputies in the 2007 to 2011 session of the lower chamber of the brazilian congress. each deputy voted yes, no, or abstained for each legislation, which we represent as f1; −1; 0g, respectively. since the temporal ordering of the voting sessions is likely to be of secondary importance to the voting outcomes, we assume the votes are sampled from an ising model [the addition of zero-valued spins changes eq. (9) only slightly by replacing 2 coshðxþ → 1 þ 2 coshðxþ]. figure 4 shows the result of the reconstruction, where the division of the nodes uncovers a cohesive government and a split opposition, as well as a marginal center group, which correlates very well with the known party memberships and can be used to predict unseen voting behavior (see supplemental material [37] for more details). in fig. 5 , we show the result of the reconstruction of the directed network of influence between n ¼ 1833 twitter users from 58224 retweets [50] using a si epidemic model (the act of "retweeting" is modeled as an infection event, using eqs. (5) and (6) with γ ¼ 0) and the nested dc-sbm. the reconstruction uncovers isolated groups with varying propensities to retweet, as well as groups that tend to influence a large fraction of users. by inspecting the geolocation metadata on the users, we see that the inferred groups amount, to a large extent, to different countries, although clear subdivisions indicate that this is not the only factor governing the influence among users (see supplemental material [37] for more details). conclusion.-we have presented a scalable bayesian method to reconstruct networks from functional observations that uses the sbm as a structured prior and, hence, performs community detection together with reconstruction. the method is nonparametric and, hence, requires no prior stipulation of aspects of the network and size of the model, such as number of groups. by leveraging inferred correlations between edges, the sbm includes an additional source of evidence and, thereby, improves the reconstruction accuracy, which in turn also increases the accuracy of the inferred communities. the overall approach is general, requiring only appropriate functional model specifications, and can be coupled with an open ended variety of such models other than those considered here. [51, 52] for details on the layout algorithm), and the edge colors indicate the infection probabilities τ ij as shown in the legend. the text labels show the dominating country membership for the users in each group. inferring gene regulatory networks from multiple microarray datasets dynamic models of large-scale brain activity estimating spatial coupling in epidemiological systems: a mechanistic approach bootstrapping topological properties and systemic risk of complex networks using the fitness model the role of social networks in information diffusion network inference with confidence from multivariate time series revealing network connectivity from response dynamics inferring network topology from complex dynamics revealing physical interaction networks from statistics of collective dynamics learning factor graphs in polynomial time and sample complexity reconstruction of markov random fields from samples: some observations and algorithms, in approximation, randomization and combinatorial optimization. algorithms and techniques which graphical models are difficult to learn estimation of sparse binary pairwise markov networks using pseudo-likelihoods inverse statistical problems: from the inverse ising problem to data science inferring networks of diffusion and influence on the convexity of latent social network inference learning the graph of epidemic cascades statistical inference approach to structural reconstruction of complex networks from binary time series maximum-likelihood network reconstruction for sis processes is np-hard network reconstruction from infection cascades escaping the curse of dimensionality in estimating multivariate transfer entropy causal network inference by optimal causation entropy reconstructing propagation networks with natural diversity and identifying hidden sources efficient reconstruction of heterogeneous networks from time series via compressed sensing robust reconstruction of complex networks from sparse data universal data-based method for reconstructing complex networks with binary-state dynamics reconstructing weighted networks from dynamics reconstructing network topology and coupling strengths in directed networks of discrete-time dynamics community detection in networks: a user guide bayesian stochastic blockmodeling exact recovery in the ising blockmodel community detection in networks with unobserved edges network structure from rich but noisy data reconstructing networks with unknown and heterogeneous errors stochastic blockmodels and community structure in networks nonparametric bayesian inference of the microcanonical stochastic block model for summary of the full generative model used, details of the inference algorithm and more information on the analysis of empirical data efficient monte carlo and greedy heuristic for the inference of stochastic block models missing and spurious interactions and the reconstruction of complex networks epidemic processes in complex networks spatial interaction and the statistical analysis of lattice systems equation of state calculations by fast computing machines monte carlo sampling methods using markov chains and their applications asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications artifacts or attributes? effects of resolution on the little rock lake food web note that, in this case, our method also exploits the heterogeneous degrees in the network via the dc-sbm, which can refinements of this approach including thouless-anderson-palmer (tap) and bethe-peierls (bp) corrections [14] yield the same performance for this example pseudolikelihood decimation algorithm improving the inference of the interaction network in a general class of ising models the simple rules of social contagion hierarchical block structures and high-resolution model selection in large networks hierarchical edge bundles: visualization of adjacency relations in hierarchical data key: cord-027286-mckqp89v authors: ksieniewicz, paweł; goścień, róża; klinkowski, mirosław; walkowiak, krzysztof title: pattern recognition model to aid the optimization of dynamic spectrally-spatially flexible optical networks date: 2020-05-23 journal: computational science iccs 2020 doi: 10.1007/978-3-030-50423-6_16 sha: doc_id: 27286 cord_uid: mckqp89v the following paper considers pattern recognition-aided optimization of complex and relevant problem related to optical networks. for that problem, we propose a four-step dedicated optimization approach that makes use, among others, of a regression method. the main focus of that study is put on the construction of efficient regression model and its application for the initial optimization problem. we therefore perform extensive experiments using realistic network assumptions and then draw conclusions regarding efficient approach configuration. according to the results, the approach performs best using multi-layer perceptron regressor, whose prediction ability was the highest among all tested methods. according to cisco forecasts, the global consumer traffic in the internet will grow on average with annual compound growth rate (cagr) of 26% in years 2017-2022 [3] . the increase in the network traffic is a result of two main trends. firstly, the number of devices connected to the internet is growing due to the increasing popularity of new services including internet of things (iot ). the second important trend influencing the traffic in the internet is popularity of bandwidth demanding services such as video streaming (e.g., netflix ) and cloud computing. the internet consists of many single networks connected together, however, the backbone connecting these various networks are optical networks based on fiber connections. currently, the most popular technology in optical networks is wdm (wavelength division multiplexing), which is expected to be not efficient enough to support increasing traffic in the nearest future. in last few years, a new concept for optical networks has been deployed, i.e., architecture of elastic optical networks (eons). however, in the perspective on the next decade some new approaches must be developed to overcome the predicted "capacity crunch" of the internet. one of the most promising proposals is spectrally-spatially flexible optical network (ss-fon) that combines space division multiplexing (sdm) technology [14] , enabling parallel transmission of co-propagating spatial modes in suitably designed optical fibers such as multi-core fibers (mcfs) [1] , with flexible-grid eons [4] that enable better utilization of the optical spectrum and distanceadaptive transmissions [15] . in mcf-based ss-fons, a challenging issue is the inter-core crosstalk (xt) effect that impairs the quality of transmission (qot ) of optical signals and has a negative impact on overall network performance. in more detail, mcfs are susceptible to signal degradation as a result of the xt that happens between adjacent cores whenever optical signals are transmitted in an overlapping spectrum segment. addressing the xt constraints significantly complicates the optimization of ss-fons [8] . besides numerous advantages, new network technologies bring also challenging optimization problems, which require efficient solution methods. since the technologies and related problems are new, there are no benchmark solution methods to be directly applied and hence many studies propose some dedicated optimization approaches. however, due to the problems high complexity, their performance still needs a lot of effort to be put [6, 8] . we therefore observe a trend to use artificial intelligence techniques (with the high emphasis on pattern recognition tools) in the field of optimization of communication networks. according to the literature surveys in this field [2, 10, 11, 13] , the researchers mostly focus on discrete labelled supervised and unsupervised learning problems, such as traffic classification. regression methods, which are in the scope of that paper, are mostly applied for traffic prediction and estimation of quality of transmission (qot ) parameters such as delay or bit error rate. this paper extends our study initiated in [7] . we make use of pattern recognition models to aid optimization of dynamic mcf-based ss-fons in order to improve performance of the network in terms of minimizing bandwidth blocking probability (bbp), or in other words to maximize the amount of traffic that can be allocated in the network. in particular, an important topic in the considered optimization problem is selection of a modulation format (mf) for a particular demand, due to the fact that each mf provides a different tradeoff between required spectrum width and transmission distance. to solve that problem, we define applicable distances for each mf (i.e., minimum and maximum length of a routing path that is supported by each mf). to find values of these distances, which provide best allocation results, we construct a regression model and then combine it with monte carlo search. it is worth noting that this work does not address dynamic problems in the context of changing the concept over time, as is often the case with processing large sets, and assumes static distribution of the concept [9] . the main novelty and contribution of the following work is an in-depth analysis of the basic regression methods stabilized by the structure of the estimator ensemble [16] and assessment of their usefulness in the task of predicting the objective function for optimization purposes. in one of the previous works [7] , we confirmed the effectiveness of this type of solution using a regression algorithm of the nearest weighted neighbors, focusing, however, much more on the network aspect of the problem being analyzed. in the present work, the main emphasis is on the construction of the prediction model. its main purpose is: -a proposal to interpret the optimization problem in the context of pattern recognition tasks. the rest of the paper is organized as follows. in sect. 2, we introduce studied network optimization problem. in sect. 3, we discuss out optimization approach for that problem. next, in sect. 4 we evaluate efficiency of the proposed approach. eventually, sect. 5 concludes the work. the optimization problem is known in the literature as dynamic routing, space and spectrum allocation (rssa) in ss-fons [5] . we are given with an ss-fon topology realized using mcfs. the topology consists of nodes and physical link. each physical link comprises of a number of spatial cores. the spectrum width available on each core is divided into arrow and same-sized segments called slices. the network is in its operational state -we observe it in a particular time perspective given by a number of iterations. in each iteration (i.e., a time point), a set of demands arrives. each demand is given by a source node, destination node, duration (measured in the number of iterations) and bitrate (in gbps). to realize a demand, it is required to assign it with a light-path and reserve its resources for the time of the demand duration. when a demand expires, its resources are released. a light-path consists of a routing path (a set of links connecting demand source and destination nodes) and a channel (a set of adjacent slices selected on one core) allocated on the path links. the channel width (number of slices) required for a particular demand on a particular routing path depends on the demand bitrate, path length (in kilometres) and selected modulation format. each incoming demand has to be realized unless there is not enough free resources when it arrives. in such a case, a demand is rejected. please note that the selected light-paths in i -th iteration affect network state and allocation possibilities in the next iterations. the objective function is defined here as bandwidth blocking probability (bbp) calculated as a summed bitrate of all rejected demands divided by the summed bitrate of all offered demands. since we aim to support as much traffic as it is possible, the objective criterion should be minimized [5, 8] . the light-paths' allocation process has to satisfy three basic rssa constraints. first, each channel has to consists of adjacent slices. second, the same channel (i.e., the same slices and the same core) has to be allocated on each link included in a light-path. third, in each time point each slice on a particular physical link and a particular core can be used by at most one demand [8] . there are four modulation formats available for transmissions-8-qam, 16-qam, qpsk and bpsk. each format is described by its spectral efficiency, which determines number of slices required to realize a particular bitrate using that modulation. however, each modulation format is also characterized by the maximum transmission distance (mtd) which provides acceptable value of optical signal to noise ratio (osnr) at the receiver side. more spectrally-efficient formats consume less spectrum, however, at the cost of shorter mtds. moreover, more spectrally-efficient formats are also vulnerable to xt effects which can additionally degrade qot and lead to demands' rejection [7, 8] . therefore, the selection of the modulation format for each demand is a compromise between spectrum efficiency and qot. to answer that problem, we use the procedure introduced in [7] to select a modulation format for a particular demand and routing path [7] . let m = 1, 2, 3, 4 denote modulation formats ordered in increasing mtds (and in decreasing spectral efficiency at the same time). it means that m = 1 denotes 8-qam and m = 4 denotes bpsk. let mt d = [mtd 1 , mtd 2 , mtd3, mtd 4 ] be a vector of mtds for modulations 8-qam, 16-qam, qpsk, bpsk respectively. moreover, let at d = [atd 1 , atd 2 , atd3, atd 4 ] (where atd i <= mtd i , i = 1, 2, 3, 4) be the vector of applicable transmission distances. for a particular demand and a routing path we select most spectrally-efficient modulation format i for which atd i is grater of equal to the selected path length and the xt effect is on an acceptable level. for each candidate modulation format, we asses the xt level based on the adjacent resources' (i.e., slices and cores) availability using procedure proposed in [7] . it is important to note that we do not indicate atd 4 (for bpsk) since we assume that this modulation is able to support transmission on all candidate routing paths regardless of their length. please also note that when xt level is too high for all modulation formats, the demand is rejected regardless of the light-paths' availability. in sect. 2 we have studied rssa problem and emphasised the importance of efficient modulation selection task. for that task we have proposed solution method whose efficiency strongly depends on the applied atd vector. therefore, we aim to find atd * vector that provides best results. the vector elements have to be positive and have upper bounds given by vector mtd. moreover, the following condition have to be satisfied: atd i < atd i+1 , i = 1, 2. since solving rssa instances is a time consuming process, it is impossible to evaluate all possible atd vectors in a reasonable time. we therefore make use of regression methods and propose a scheme to find atd * depicted in fig. 1 . a representative set of 1000 different atd vectors is generated. then, for each of them we simulate allocation of demands in ss-fon (i.e., we solve dynamic rssa). for the purpose of demands allocation (i.e., selection of light-paths), we use a dedicated algorithm proposed in [7] . for each considered atd vector we save obtained bbp. based on that data, we construct a regression model, which predicts bbp based on an atd vector. having that model, we use monte carlo method to find atd * vector, which is recommended for further experiments. to solve an rssa instance for a particular atd vector, we use heuristic algorithm proposed in [7] . we work under the assumption that there are 30 candidate routing paths for each traffic demand (generated using dijkstra algorithm). since the paths are generated in advance and their lengths are known, we can use an atd vector and preselect for these paths modulation formats based on the procedure discussed in sect. 2. therefore, rssa is reduced to the selection of one of the candidate routing paths and a communication channel with respect to the resource availability and assessed xt levels. from the perspective of pattern recognition methods, the abstraction of the problem is not the key element of processing. the main focus here is the representation available to construct a proper decision model. for the purposes of considerations, we assume that both input parameters and the objective function take only quantitative and not qualitative values, so we may use probabilistic pattern recognition models to process them. if we interpret the optimization task as searching for the extreme function of many input parameters, each simulation performed for their combination may also be described as a label for the training set of supervised learning model. in this case, the set of parameters considered in a single simulation becomes a vector of object features (x n ), and the value of the objective function acquired around it may be interpreted as a continuous object label (y n ). repeated simulation for randomly generated parameters allows to generate a data set (x) supplemented with a label vector (y). a supervised machine learning algorithm can therefore gain, based on such a set, a generalization abilities that allows for precise estimation of the simulation result based on its earlier runs on the random input values. a typical pattern recognition experiment is based on the appropriate division of the dataset into training and testing sets, in a way that guarantees their separability (most often using cross-validation), avoiding the problem of data peeking and a sufficient number of repetitions of the validation process to allow proper statistical testing of mutual model dependencies hypotheses. for the needs of the proposal contained in this paper, the usual 5-fold cross validation was adopted, which calculates the value of the r 2 metric for each loop of the experiment. having constructed regression model, we are able to predict bbp value for a sample atd vector. please note that the time required for a single prediction is significantly shorter that the time required to simulate a dynamic rssa. the last step of our optimization procedure is to find atd * -vector providing lowest estimated bbp values. to this end, we use monte carlo method with a number of guesses provided by the user. the rssa problem was solved for two network topologies-dt12 (12 nodes, 36 links) and euro28 (28 nodes, 82 links). they model deutsche telecom (german national network) and european network, respectively. each network physical link comprised of 7 cores wherein each of the cores offers 320 frequency slices of 12.5 ghz width. we use the same network physical assumptions and xt levels and assessments as in [7] . traffic demands have randomly generated end nodes and birates uniformly distributed between 50 gbps and 1 tbps, with granularity of 50 gbps. their arrival follow poisson process with an average arrival rate λ demands per time unit. the demand duration is generated according to a negative exponential distribution with an average of 1/μ. the traffic load offered is λ/μ normalized traffic units (ntus). for each testing scenario, we simulate arrival of 10 6 demands. four modulations are available (8-qam, 16-qam, qpsk, bpsk) wherein we use the same modulation parameters as in [7] . for each topology we have generated 9 different datasets, each consists of 1000 samples of atd vector and corresponding bbp. the datasets differ with the xt coefficient (μ = 1 · 10 −9 indicated as "xt1", μ = 2 · 10 −9 indicated as "xt2", for more details we refer to [7] ) and network links scaling factor (the multiplier used to scale lengths of links in order to evaluate if different lengths of routing paths influence performance of the proposed approach). for dt12 we use following scaling factors: 0.4, 0.6, 0.8, . . . , 2.0. for euro28 the values are as follows: 0.104, 0.156, 0.208, 0.260, 0.312, 0.364, 0.416, 0.468, 0.520. we indicate them as "sx.xxx " where x.xxx refers to the scaling factor value. using these datasets we can evaluate whether xt coefficient (i.e., level of the vulnerability to xt effects) and/or average link length influence optimization approach performance. the experimental environment for the construction of predictive models, including the implementation of the proposed processing method, was implemented in python, following the guidelines of the state-of-art programming interface of the scikit-learn library [12] . statistical dependency assessment metrics for paired tests were calculated according to the wilcoxon test, according to the implementation contained in scipy module. each of the individual experiments was evaluated by r 2 score -a typical quality assessment metric for regression problems. the full source code, supplemented with employed datasets is publicly available in a git repository 1 . five simple recognition models were selected as the base experimental estimators: knr-k-nearest neighbors regressor with five neighbors, leaf size of 30 and euclidean metric approximated by minkowski distance, -dknr-knr regressor weighted by distance from closest patterns, mlp-a multilayer perceptron with one hidden layer of one hundred neurons, with the relu activation function and adam optimizer, dtr-cart tree with mse split criterion, lin-linear regression algorithm. in this section we evaluate performance of the proposed optimization approach. to this end, we conduct three experiments. experiment 1 focuses on the number of patterns required to construct a reliable prediction model. experiment 2 assesses the statistical dependence of built models. eventually, experiment 3 verifies efficiency of the proposed approach as a function of number of guesses in the monte carlo search. the first experiment carried out as part of the approach evaluation is designed to verify how many patterns -and thus how many repetitions of simulations -must be passed to individual regression algorithms to allow the construction of a reliable prediction model. the tests were carried out on all five considered regressors in two stages. first, the range from 10 to 100 patterns was analyzed, and in the second, from 100 to 1000 patterns per processing. it is important to note that due to the chosen approach to cross-validation, in each case the model is built on 80% of available objects. the analysis was carried out independently on all available data sets, and due to the non-deterministic nature of sampling of available patterns, its results were additionally stabilized by repeating a choice of the objects subset five times. in order to allow proper observations, the results were averaged for both topologies. plots for the range from 100 to 1000 patterns were additionally supplemented by marking ranges of standard deviation of r 2 metric acquired within the topology and presented in the range from the .8 value. the results achieved for averaging individual topologies are presented in figs. 2 and 3 . for dt12 topology, mlp and dtr algorithms are competitively the best models, both in terms of the dynamics of the relationship between the number of patterns and the overall regression quality. the linear regression clearly stands out from the rate. a clear observation is also the saturation of the models, understood by approaching the maximum predictive ability, as soon as around 100 patterns in the data set. the best algorithms already achieve quality within .8, and with 600 patterns they stabilize around .95. the relationship between each of the recognition algorithms and the number of patterns takes the form of a logarithmic curve in which, after fast initial growth, each subsequent object gives less and less potential for improving the quality of prediction. this suggests that it is not necessary to carry out further simulations to extend the training set, because it will not significantly affect the predictive quality of the developed model. very similar observations may be made for euro28 topology, however, noting that it seems to be a simpler problem, allowing faster achievement of the maximum model predictive capacity. it is also worth noting here the fact that the standard deviation of results obtained by mlp is smaller, which may be equated with the potentially greater stability of the model achieved by such a solution. the second experiment extends the research contained in experiment 1 by assessing the statistical dependence of models built on a full datasets consisting of a thousand samples for each case. the results achieved are summarized in tables 1a and b. as may be seen, for the dt12 topology, the lin algorithm clearly deviates negatively from the other methods, in absolutely every case being a worse solution than any of the others, which leads to the conclusion that we should completely reject it from considering as a base for a stable recognition model. algorithms based on neighborhood (knr and dknr) are in the middle of the rate, in most cases statistically giving way to mlp and dtr, which would also suggest departing from them in the construction of the final model. the statistically best solutions, almost equally, in this case are mlp and dtr. for euro28 topology, the results are similar when it comes to lin, knr and dknr approaches. a significant difference, however, may be seen for the achievements of dtr, which in one case turns out to be the worst in the rate, and in many is significantly worse than mlp. these observations suggest that in the final model for the purposes of optimization lean towards the application of neural networks. what is important, the highest quality prediction does not exactly mean the best optimization. it is one of the very important factors, but not the only one. it is also necessary to be aware of the shape of the decision function. for this purpose, the research was supplemented with visualizations contained in fig. 4 . algorithms based on neighborhood (knn, dknn) and decision trees (dtr) are characterized by a discrete decision boundary, which in the case of visualization resembles a picture with a low level of quantization. in the case of an ensemble model, stabilized by cross-validation, actions are taken to reduce this property in order to develop as continuous a border as possible. as may be seen in the illustrations, compensation occurs, although in the case of knn and dknn leads to some disturbances in the decision boundary (interpreted as thresholding the predicted label value), and for the dtr case, despite the general correctness of the performed decisions, it generates image artifacts. such a model may still retain high predictive ability, but it has too much tendency to overfit and leads to insufficient continuity of the optimized function to perform effective optimization. clear decision boundaries are implemented by both the lin and mlp approaches. however, it is necessary to reject lin from processing due to the linear nature of the prediction, which (i ) in each optimization will lead to the selection of the extreme value of the analyzed range and (ii ) is not compatible with the distribution of the explained variable and must have the largest error in each of the optimas. summing up the observations of experiments 1 and 2, the mlp algorithm was chosen as the base model for the optimization task. it is characterized by (i ) statistically best predictive ability among the methods analyzed and (ii ) the clearest decision function from the perspective of the optimization task. the last experiment focuses on the finding of best atd vector based on the constructed regression model. to this end, we use monte carlo method with different number of guesses. tables 2 and 3 present the obtained results as a function of number of guesses, which changes from 10 1 up to 10 9 . the results quality increases with the number of guesses up to some threshold value. then, the results do not change at all or change only a little bit. according to the presented values, monte carlo method applied with 10 3 guesses provides satisfactory results. we therefore recommend that value for further experiments. the following work has considered the topic of employing pattern recognition methods to support ss-fon optimization process. for a wide pool of generated cases, analyzing two real network topologies, the effectiveness of solutions implemented by five different, typical regression methods was analyzed, starting from logistic regression and ending with neural networks. conducted experimental analysis shows, with high probability obtained by conducting proper statistical validation, that mlp is characterized by the greatest potential in this type of solutions. even with a relatively small pool of input simulations, constructing a data set for learning purpouses, interpretable in both the space of optimization and machine learning problems, simple networks of this type achieve both high quality prediction measured by the r 2 metric, and continuous decision space creating the potential for conducting optimization. basing the model on the stabilization realized by using ensemble of estimators additionally allows to reduce the influence of noise on optimization, whichin a state-of-art optimization methods -could show a tendency to select invalid optimas, burdened by the nondeterministic character of the simulator. further research, developing ideas presented in this article, will focus on the generalization of the presented model for a wider pool of network optimization problems. high-capacity transmission over multi-core fibers a comprehensive survey on machine learning for networking: evolution, applications and research opportunities visual networking index: forecast and trends elastic optical networking: a new dawn for the optical layer on the efficient dynamic routing in spectrally-spatially flexible optical networks on the complexity of rssa of any cast demands in spectrally-spatially flexible optical networks machine learning assisted optimization of dynamic crosstalk-aware spectrallyspatially flexible optical networks survey of resource allocation schemes and algorithms in spectrally-spatially flexible optical networking data stream classification using active learned neural networks artificial intelligence (ai) methods in optical networks: a comprehensive survey an overview on application of machine learning techniques in optical networks scikit-learn: machine learning in python machine learning for network automation: overview, architecture, and applications survey and evaluation of space division multiplexing: from technologies to optical networks modeling and optimization of cloud-ready and content-oriented networks. ssdc classifier selection for highly imbalanced data streams with minority driven ensemble key: cord-024571-vlklgd3x authors: kim, yushim; kim, jihong; oh, seong soo; kim, sang-wook; ku, minyoung; cha, jaehyuk title: community analysis of a crisis response network date: 2019-07-28 journal: soc sci comput rev doi: 10.1177/0894439319858679 sha: doc_id: 24571 cord_uid: vlklgd3x this article distinguishes between clique family subgroups and communities in a crisis response network. then, we examine the way organizations interacted to achieve a common goal by employing community analysis of an epidemic response network in korea in 2015. the results indicate that the network split into two groups: core response communities in one group and supportive functional communities in the other. the core response communities include organizations across government jurisdictions, sectors, and geographic locations. other communities are confined geographically, homogenous functionally, or both. we also find that whenever intergovernmental relations were present in communities, the member connectivity was low, even if intersectoral relations appeared together within them. other or are friends, know each other, etc." which generally refers to a social circle (mokken, 1979, p. 161) , while a community is formed through concrete social relationships (e.g., high school friends) or sets of people perceived to be similar, such as the italian community and twitter community (gruzd, wellman, & takhteyev, 2011; hagen, keller, neely, depaula, & robert-cooperman, 2018) . in social network analysis, a clique is operationalized as " . . . a subset of actors in which every actor is adjacent to every other actor in the subset (borgatti, everett, & johnson, 2013, p. 183) , while communities refer to " . . . groups within which the network connections are dense, but between which they are sparser" (newman & girvan, 2004, p. 69) . the clique and its variant definitions (e.g., n-cliques and k-cores) focus on internal edges, while the community is a concept based on the distinction between internal edges and the outside. we argue that community analysis can provide useful insights about the interrelations among diverse organizations in the ern. we have not yet found any studies that have investigated cohesive subgroups in large multilevel, multisectoral erns through a community lens. with limited guidance from the literature on erns, we lack specific expectations or hypotheses about what the community structure in the network may look like. therefore, our study focuses on identifying and analyzing communities in the 2015 middle east respiratory syndrome coronavirus (mers) response in south korea as a case study. we address the following research questions: (1) in what way were distinctive communities divided in the ern? and (2) how did the interorganizational relations relate to the internal characteristics of the communities? by detecting and analyzing the community structure in an ern, we offer insights for future empirical studies on erns. the interrelations in erns have been examined occasionally by analyzing the entire network's structure. for example, the katrina case exhibited a large and sparse network, 1 in which a small number of nodes had a large number of edges and a large number of nodes had a small number of edges (butts, acton, & marcum, 2012) . the katrina response network can be thought of as " . . . a loosely connected set of highly cohesive clusters, surrounded by an extensive 'halo' of pendant trees, small independent components, and isolates" (butts et al., 2012, p. 23) . the network was sparse and showed a tree-like structure but also included cohesive substructures. other studies on the katrina response network have largely concurred with these observations (comfort & haase, 2006; kapucu, arslan, & collins, 2010) . in identifying cohesive subgroups in the katrina response network, these studies rely on the analysis of cliques: "a maximal complete subgraph of three or more nodes" (wasserman & faust, 1994, p. 254) or clique-like (n-cliques or k-cores). the n-cliques can include nodes that are not in the clique but are accessible. similarly, k-cores refer to maximal subgraphs with a minimum degree of at least k. many cliques were identified in the katrina response network, in which federal and state agencies appeared frequently (comfort & haase, 2006; kapucu, 2005) . using k-cores analysis, butts, acton, and marcum (2012) suggest that the katrina response network's inner structure was built around a small set of cohesive subgroups that was divided along institutional lines corresponding to five state clusters (alabama, colorado, florida, georgia, and virginia), a cluster of u.s. federal organizations, and one of nongovernmental organizations. while these studies suggest the presence of cohesive subgroups in erns, we have not found any research that thoroughly discussed subsets of organizations' significance in erns. from the limited literature, we identify two different, albeit related, reasons that cohesive subgroups have interested ern researchers. in their analysis of cohesive subgroups using cliques, comfort and haase (2006) assume that a cohesive subgroup can facilitate achieving shared tasks as a group, but it can be less adept at managing the full flow of information and resources across groups and thus decreasing the entire network's coherence. kapucu and colleagues (2010) indicate that the recurrent patterns of interaction among the sets of selected organizations may be the result of excluding other organizations in decision-making, which may be a deterrent to all organizations' harmonious concerted efforts in disaster responses. comfort and haase (2006) view cliques as an indicator of " . . . the difficulty of enabling collective action across the network" (p. 339), 2 and others have adhered closely to this perspective (celik & corbacioglu, 2016; hossain & kuti, 2010; kapucu, 2005) . cohesive subgroups such as cliques are assumed to be a potential hindrance to the entire network's performance. the problem with this perspective is that one set of eyes can perceive cohesive subgroups in erns as a barrier, while another can regard them as a facilitator of an effective response. while disaster and emergency response plans are inherently limited and not implemented in practice as intended (clarke, 1999) , stakeholder organizations' responses may be performed together with presumed structures, particularly in a setting in which government entities are predominant. for example, the incident command system (ics) 3 was designed to improve response work's efficiency by constructing a standard operating procedure (moynihan, 2009 ). structurally, one person serves as the incident commander who is responsible for directing all other responders (kapucu & garayev, 2016) . ics is a somewhat hierarchical command-and-control system with functional arrangements in five key resources and capabilities-that is, command, operations, planning, logistics, and finance (kapucu & garayev, 2016) . in an environment in which such an emergency response model is implemented, it is realistic to expect clusters and subgroups to reflect the model's structural designs and arrangements, and they may be intentionally designed to facilitate coordination, communication, and collaboration with other parts or subgroups efficiently in a large response network. others are interested in identifying cohesive subgroups because they may indicate a lack of cross-jurisdictional and cross-sectoral collaboration in erns. during these responses, public organizations in different jurisdictions participate, and a sizable number of organizations from nongovernmental sectors also become involved (celik & corbacioglu, 2016; comfort & haase, 2006; kapucu et al., 2010; spiro, acton, & butts, 2013) . organizational participation by multiple government levels and sectors is often necessary because knowledge, expertise, and resources are distributed in society. participating organizations must collaborate and coordinate their efforts. however, studies have suggested that interactions in erns are limited and primarily occur among similar organizations, particularly within the same jurisdiction. that is, public organizations tend to interact more frequently with other public organizations in specific geographic locations (butts et al., 2012; hossain & kuti, 2010; kapucu, 2005; tang, deng, shao, & shen, 2017) . these studies indicate that organizations have been insufficiently integrated across government jurisdictions (tang et al., 2017) or sectors (butts et al., 2012; hossain & kuti, 2010) , and the identification of cliques composed of similar organizations reinforces such a concern. in our view, there is a greater, or perhaps more interesting, question related to the crossjurisdictional and cross-sectoral integration in interorganizational response networks: how are intergovernmental relations mixed with intersectoral relations in erns? here, we use the term interorganizational relations to refer to both intergovernmental and intersectoral relations. intergovernmental relations refer to the interaction among organizations across different government levels (local, provincial, and national) , and intersectoral relations involve the interaction among organizations across different sectors (public, private, nonprofit, and civic sectors). recent studies have suggested that both intergovernmental and intersectoral relations shape erns (kapucu et al., 2010; kapucu & garayev, 2011; tang et al., 2017) , but few have analyzed the way the two interorganizational relations intertwine. if the relation interdependencies in the entire network are of interest to ern researchers, as is the case in this article, focusing on cliques may not necessarily be the best approach to the question because clique analysis may continue to find sets of selected organizations that are tightly linked for various reasons. the analysis of cliques is a very strict way of operationalizing cohesive subgroups from a social network perspective (moody & coleman, 2015) , and there are two issues with using it to identify cohesive subgroups in erns. first, clique analysis assumes complete connections of three or more subgroup members, while real-world networks tend to have many small overlapping cliques that do not represent distinct groups (moody & coleman, 2015) . even if substantively meaningful cliques appear, they may not necessarily imply a lack of information flow across subgroups or other organizations' exclusion, as previous ern studies have assumed (comfort & haase, 2006; kapucu et al., 2010) . second, clique analysis assumes no internal differentiation in members' structural position within the subgroup (wasserman & faust, 1994) . in a task-oriented network such as an ern, organizations within a subgroup may look similar (e.g., all fire organizations). however, this does not imply that they are identical in their structural positions. when these assumptions in clique analysis do not hold, identifying cohesive subgroups as cliques is inappropriate (wasserman & faust, 1994) . similarly, other clique-like approaches (n-cliques and k-cores) demand an answer to the question: "what is the n-or k-?" the clique and clique-like approaches have a limited ability to define and identify cohesive subgroups in a task-oriented network because they do not clearly explain why the subgroups need to be defined and identified in such a manner. we proposed a different way of thinking about and finding subsets of organizations in erns: community. when a network consists of subsets of nodes with many edges that connect nodes of the same subset, but few that lay between subsets, the network is said to have a community structure (wilkinson & huberman, 2004) . network researchers have developed methods with which to detect communities (fortunato, latora, & marchiori, 2004; latora & marchiori, 2001; lim, kim, & lee, 2016; newman & girvan, 2004; yang & leskovec, 2014) . optimization approaches, such as the louvain and leiden methods, which we use in this article, sort nodes into communities by maximizing a clustering objective function (e.g., modularity). beginning with each node in its own group, the algorithm joins groups together in pairs, choosing the pairs that maximize the increase in modularity (moody & coleman, 2015) . this method performs an iterative process of node assignments until modularity is maximized and leads to a hierarchical nesting of nodes (blondel, guillaume, lambiotte, & lefebvre, 2008) . recently, the louvain algorithm was upgraded and improved as the leiden algorithm that addresses some issues in the louvain algorithm (traag, waltman, & van eck, 2018) . modularity (q), which shows the quality of partitions, is measured and assessed quantitatively: in which e ii is the fraction of the intra-edges of community i over all edges, and e ij is the fraction of the inter-edges between community i and community j over all edges. modularity scores are used to compare assignments of nodes into different communities and also the final partitions. it is calculated as a normalized index value: if there is only one group in a network, q takes the value of zero; if all ties are within separate groups, q takes the maximum value of one. thus, a higher q indicates a greater portion of intra-than inter-edges, implying a network with a strong community structure (fortunato et al., 2004) . currently, there are two challenges in community detection studies. first, the modular structure in complex networks usually is not known beforehand (traag et al., 2018) . we know the community structure only after it is identified. second, there is no formal definition of community in a graph (reichardt & bornholdt, 2006; wilkinson & huberman, 2004) , it simply is a concept of relative density (moody & coleman, 2015) . a high modularity score ensures only that " . . . the groups as observed are distinct, not that they are internally cohesive" (moody & coleman, 2015, p. 909 ) and does not guarantee any formal limit on the subgroup's internal structure. thus, internal structure must be examined, especially in such situations as erns. despite these limitations, efforts to reveal underlying community structures have been undertaken with a wide range of systems, including online and off-line social systems, such as an e-mail corpus of a million messages in organizations (tyler, wilkinson, & huberman, 2005) , zika virus conversation communities on twitter (hagen et al., 2018) , and jazz musician networks (gleiser & danon, 2003) . further, one can exploit complex networks by identifying their community structure. for example, salathé and jones (2010) showed that community structures in human contact networks significantly influence infectious disease dynamics. their findings suggest that, in a network with a community structure, targeting individuals who bridge communities for immunization is better than intervening with highly connected individuals. we exploit the community detection and analysis to understand an ern's substructure in the context of an infectious disease outbreak. it is difficult to know the way communities in erns will form beforehand without examining clusters and their compositions and connectivity in the network. we may expect to observe communities that consist of diverse organizations because organizations' shared goal in erns is to respond to a crisis by performing necessary tasks (e.g., providing mortuary and medical services as well as delivering materials) through concerted efforts on the part of those with different capabilities (moynihan, 2009; waugh, 2003) . organizations that have different information, skills, and resources may frequently interact in a disruptive situation because one type alone, such as the government or organizations in an affected area, cannot cope effectively with the event (waugh, 2003) . on the other hand, we also cannot rule out the possibility shown in previous studies (butts et al., 2012; comfort & haase, 2006; kapucu, 2005) . organizations that work closely in normal situations because of their task similarity, geographic locations, or jurisdictions may interact more frequently and easily, even in disruptive situations (hossain & kuti, 2010) , and communities may be identified that correspond to those factors. a case could be made that communities in erns consist of heterogeneous organizations, but a case could also be made that communities are made up of homogeneous organizations with certain characteristics. it is equally difficult to set expectations about communities' internal structure in erns. we can expect that, regardless of their types, sectors, and locations, some organizations work and interact closely-perhaps even more so in such a disruptive situation. emergent needs for coordination, communication, and collaboration also can trigger organizational interactions that extend beyond the usual or planned structure. thus, the relations among organizations become dense and evolve into the community in which every member is connected. on the other hand, a community in the task network may not require all of the organizations within it to interact. for example, if a presumed structure is strongly established, organizations are more likely to interact with others within the planned structure following the chain of command and control. even without such a structure, government organizations may coordinate their responses following the existing chain of command and control in their routine. we may expect to observe communities with a sparse connection among organizations. thus, the way communities emerge in erns is an open empirical question that can be answered by examining the entire network. several countries have experienced novel infectious disease outbreaks over the past decade (silk, 2018; swaan et al., 2018; williams et al., 2015) and efforts to control such events have been more or less successful, depending upon the instances and countries. in low probability, high-consequence infectious diseases such as the 2015 mers outbreak in south korea, a concerted response among individuals and organizations is virtually the only way to respond because countermeasures-such as vaccines-are not readily available. thus, to achieve an effective response, it is imperative to understand the way individuals and organizations mobilize and respond in public health emergencies. however, the response system for a national or global epidemic is highly complex (hodge, 2015; sell et al., 2018; williams et al., 2015) because of several factors: (1) the large number of organizations across multiple government levels and sectors, (2) the diversity of and interactions among organizations for the necessary (e.g., laboratory testing) or emergent (e.g., hospital closure) tasks, and (3) concurrent outbreaks or treatments at multiple locations attributable to the virus's rapid spread. all of these factors create challenges when responding to public health emergencies. we broadly define a response network as the relations among organizations that can act as critical channels for information, resources, and support. when two organizations engage in any mers-specific response interactions, they are considered to be related in the response. examples of interactions include taking joint actions, communicating with each other, or sharing crucial information and resources (i.e., exchanging patient information, workforce, equipment, or financial support) related to performing the mers tasks, as well as having meetings among organizations to establish a collaborative network. we collected response network data from the following two archival sources: (1) news articles from south korea's four major newspapers 4 published between may 20, 2015, and december 31, 2015 (the outbreak period), and (2) a postevent white paper that the ministry of health and welfare published in december 2016. in august 2016, hanyang university's research center in south korea provided an online tagging tool for every news article in the country's news articles database that included the term "mers (http://naver.com)." a group of researchers at the korea institute for health and social affairs wrote the white paper (488 pages, plus appendices) based on their comprehensive research using multiple data sources and collection methods. the authors of this article and graduate research assistants, all of whom are fluent in korean, were involved in the data collection process from august 2016 to september 2017. because of the literature's lack of specific guidance on the data to collect from archival materials to construct interorganizational network data, we collected the data through trial and error. we collected data from news articles through two separate trials (a total of 6,187 articles from the four newspapers). the authors and a graduate assistant then ran a test trial between august 2016 and april 2017. in july 2017, the authors developed a data collection protocol based on the test trial experience collecting the data from the news articles and white paper. then, we recollected the data from the news articles between august 2017 and september 2017 using the protocol. 5 when we collected data by reviewing archival sources, we first tagged all apparent references within the source text to organizations' relational activities. organizations are defined as "any named entity that represents (directly or indirectly) multiple persons or other entities, and that acts as a de facto decision making unit within the context of the response" (butts et al., 2012, p. 6) . if we found an individual's name on behalf of the individual's organization (e.g., the secretary of the ministry of health and welfare), we coded the individual as the organization's representative. these organizational interactions were coded for a direct relation based on "whom" to "whom" and for "what purpose." then, these relational activity tags were rechecked. all explicit mentions of relations among organizations referred to in the tagged text were extracted into a sociomatrix of organizations. we also categorized individual organizations into different "groups" using the following criteria. first, we distinguished the entities in south korea from those outside the country (e.g., world health organization [who], centers for disease control and prevention [cdc] ). second, we sorted governmental entities by jurisdiction (e.g., local, provincial/metropolitan, or national) and then also by the functions that each organization performs (e.g., health care, police, fire). for example, we categorized local fire stations differently from provincial fire headquarters because these organizations' scope and role differ within the governmental structure. we categorized nongovernmental entities in the private, nonprofit, or civil society sectors that provide primary services in different service areas (e.g., hospitals, medical waste treatment companies, professional associations). at the end of the data collection process, 69 organizational groups from 1,395 organizations were identified (see appendix). 6 we employed the leiden algorithm using python (traag et al., 2018) , which we discussed in the previous section. the leiden algorithm is also available for gephi as a plugin (https://gephi.org/). after identifying communities, the network can be reduced to these communities. in generating the reduced graph, each community appears within a circle, the size of which varies according to the number of organizations in the community. the links between communities indicate the connections among community members. the thickness of the lines varies in proportion to the number of pairs of connected organizations. this process improves the ability to understand the network structure drastically and provides an opportunity to analyze the individual communities' internal characteristics such as the organizations' diversity and their connectivity for each community. shannon's diversity index (h) is used as a measure of diversity because uncertainty increases as species' diversity in a community increases (dejong, 1975) . the h index accounts for both species' richness and evenness in a community (organizational groups in a community in our case). s indicates the total number of species. the fraction of the population that constitutes a species, i, is represented by p i below and then multiplied by the natural logarithm of the proportion (lnp i ). the resulting product is then summed across species and multiplied by à1: high h values represent more diverse communities. shannon's e is calculated by e ¼ h=ln s, which indicates various species' equality in a community. when all of the species are equally abundant, maximum evenness (i.e., 1) is obtained. while limited, density and the average clustering coefficient can capture the basic idea of a subgraph's structural cohesion or "cliquishness" (moody & coleman, 2015) . a graph's density (d) is the proportion of possible edges presented in the graph, which is the ratio between the number of edges present and the maximum possible. it ranges from 0 (no edges) to 1 (if all possible lines are present). a graph's clustering coefficient (c) is the probability that two neighbors of a node are neighbors themselves. it essentially measures the way a node's neighbors form a 1-clique. c is 1 in a graph connected fully. the mers response network in the data set consists of 1,395 organizations and 4,801 edges. table 1 shows that most of the organizations were government organizations (approximately 80%) and 20% were nongovernmental organizations from different sectors. local government organizations constituted the largest proportion of organizations (68%). further, one international organization (i.e., who) and foreign government agencies or foreign medical centers (i.e., cdc, erasmus university medical center) appeared in the response network. organizations coordinated with approximately three other organizations (average degree: 3.44). however, six organizations coordinated with more than 100 others. the country's health authorities, such as the ministry of health and welfare (mohw: 595 edges), central mers management headquarters (cmmh: 551 edges), and korea centers for disease control and prevention (kcdc: 253 edges), were found to have a large number of edges. the ministry of environment (304 edges) also coordinated with many other organizations in the response. the national medical center had 160 edges, and the seoul metropolitan city government had 129. the leiden algorithm detected 27 communities in the network, labeled as 0 through 26 in figures 1-3 and tables 2 and 3. the final modularity score (q) was 0.584, showing that the community detection algorithm partitioned and identified the communities in the network reasonably well. in real-world networks, modularity scores " . . . typically fall in the range from about 0.30 to 0.70. high values are rare" (newman & girvan, 2004, p. 7) . the number of communities was also consistent in the leiden and louvain algorithms (26 communities in the louvain algorithm). the modularity score was slightly higher in the leiden algorithm than the q ¼ 0.577 in the louvain. figure 1 presents the mers response network with communities in different colors to show the organizations' clustering using forceatlas2 layout in gephi. in figure 2 , the network's community structure is clear to the human eye. from the figures (and the community analysis in table 2 ), we find that the mers response network was divided into two sets of communities according to which communities were at the center of the network and their nature of activity in the response, core response communities in one group and supportive functional communities in the other. the two core communities (1 and 2) at the center of the response network included a large number of organizations, with a knot of intergroup coordination among the groups surrounding those two. these communities included organizations across government jurisdictions, sectors, and geographic locations ( table 2 , description) and were actively involved in the response during the mers outbreak. while not absolute, we observe that the network of a dominating organization had a "mushroom" shape of interactions with other organizations within the communities (also see figure 3a ). the dominant organizations were the central government authorities such as the mohw, the cmmh, and kcdc. the national health authorities led the mers response. other remaining communities were (1) confined geographically, (2) oriented functionally, or (3) both. first, some communities consisted of diverse organizations in the areas where two mers hospitals are located-seoul metropolitan city and gyeonggi province (communities 3 and 5). organizations in these communities span government levels and sectors within the areas affected. second, two communities consisted of organizations with different functions and performed supportive activities (community 4, also see figure 3b ). other supportive functional communities that focus on health (community 11, see figure 3c ) or foreign affairs (community 15) had a "spiderweb" shape of interactions among organizations within the communities. third, several communities consisted of a relatively small number of organizations connected to one in the center (communities 16, 17, 18, and 19) . these consisted of local fire organizations in separate jurisdictions (see figure 3d ) that were both confined geographically and oriented functionally. table 2 summarizes the characteristics of the 27 communities in the response network. in table 2 , we also note distinct interorganizational relations present within the communities. the two core response communities include both intergovernmental and intersectoral relations. 7 that is, organizations across government jurisdictions or sectors were actively involved in response to the epidemic in the communities. while diverse organizations participated in these core communities, the central government agencies led and directed other organizations, which reduced member connectivity. among the supportive functional communities, those that are confined geographically showed relatively high diversity but low connectivity (communities 3, 5, and 6 through 10). these communities included intergovernmental relations within geographic locations. secondly, communities of organizations with a specialized function showed relatively high diversity or connectivity. these included organizations from governmental and nongovernmental sectors and had no leading or dominating organizations. for example, communities 11 and 12 had intersectoral relations but no intergovernmental relations. thirdly, within each community of fire organizations in different geographic locations, one provincial or metropolitan fire headquarters was linked to multiple local fire stations in a star network. these communities, labeled igf, had low member diversity and member connectivity, while they were organizationally and functionally coherent. table 3 summarizes the results elaborated above. in addition to the division of communities along the lines of the nature of their response activities, we observe that the structural characteristics of communities with only intersectional or international relations showed high diversity and high connectivity. whenever intergovernmental relations were present in communities, however, the member connectivity was low, even if intersectoral relations appeared together within them. we use the community detection method to gain a better understanding of the patterns of associations among diverse response organizations in an epidemic response network. the large data sets available and increased computational power significantly transform the study of social networks and can shed light on topics such as cohesive subgroups in large networks. network studies today involve mining enormous digital data sets such as collective behavior online (hagen et al., 2018) , an e-mail corpus of a million messages (tyler, wilkinson, & buberman, 2005) , or scholars' massive citation data (kim & zhang, 2016) . the scale of erns in large disasters and emergencies is noteworthy (moynihan, 2009; waugh, 2003) , and over 1,000 organizations appeared in butts et al. (2012) study as well as in this research. their connections reflect both existing structural forms by design and by emergent needs. the computational power needed to analyze such large relational data is ever higher and the methods simpler now, which allows us to learn about the entire network. we find two important results. first, the national public health ern in korea split largely into two groups. the core response communities' characteristics were that (1) they were not confined geographically, (2) organizations were heterogeneous across jurisdictional lines as well as sectors, and (3) the community's internal structure was sparse even if intersectoral relations were present. on the other hand, supportive functional communities' characteristics were that (1) they were communities of heterogeneous organizations in the areas affected that were confined geographically; (2) the communities of intersectoral, professional organizations were heterogeneous, densely connected, and not confined geographically; and (3) the communities of traditional emergency response organizations (e.g., fire) were confined geographically, homogeneous, and connected sparsely in a centralized fashion. these findings show distinct features of the response to emerging infectious diseases. the core response communities suggest that diverse organizations across jurisdictions, sectors, and functions actually performed active and crucial mers response activities. however, these organizations' interaction and coordination inside the communities were found to be top down from the key national health authorities to all other organizations. this observation does not speak to the quality of interactions in the centralized top-down structure, but one can also ask how effective such a structure can be in a setting where diverse organizations must share authority, responsibilities, and resources. second, infectious diseases spread rapidly and can break out in multiple locations simultaneously. the subgroup patterns in response networks to infectious diseases can differ from those of location-bound natural disasters such as hurricanes and earthquakes. while some organizations may not be actively or directly involved in the response, communities of these organizations can be formed to prepare for potential outbreaks or provide support to the core response communities during the event. second, we also find that the communities' internal characteristics (diversity and connectivity) differed depending upon the types of interorganizational relations that appeared within the communities. based on these analytical results, two propositions about the community structure in the ern can be developed: (1) if intergovernmental relations operate in a community, the community's member connectivity may be low, regardless of member diversity. (2) if community members are functionally similar, (a) professional organization communities' (e.g., health or foreign affairs) member connectivity may be dense and (b) emergency response organization communities' (e.g., fire) member connectivity may be sparse. the results suggest that the presence of intergovernmental relations within the communities in erns may be associated with low member connectivity. however, this finding does not imply that those communities with intergovernmental relations are not organizationally or functionally cohesive. instead, we may expect a different correlation between members' functional similarity and their member connectivity depending upon the types of professions, as seen in 2(a) and (b). organizations' concerted efforts during a response to an epidemic is a prevalent issue in many countries (go & park, 2018; hodge, gostin, & vernick, 2007; seo, lee, kim, & lee, 2015; swaan et al., 2018) . the 2015 mers outbreak in south korea led to 16,693 suspected cases, 186 infected cases, and 38 deaths in the country (korea centers for disease control and prevention, 2015) . the south korean government's response to it was severely criticized for communication breakdowns, lack of leadership, and information secrecy (korea ministry of health and welfare, 2016). the findings of this study offer a practical implication for public health emergency preparedness and response in the country studied. erns' effective structure has been a fundamental question and a source of continued debate (kapucu et al., 2010; nowell, steelman, velez, & yang, 2018 ). the answer remains unclear, but the recent opinion leans toward a less centralized and hierarchical structure, given the complexity of making decisions in disruptive situations (brooks, bodeau, & fedorowicz, 2013; comfort, 2007; hart, rosenthal, & kouzmin, 1993) . our analysis shows clearly that the community structure and structures within communities in the network were highly centralized (several mushrooms) and led by central government organizations. given that the response to the outbreak was severely criticized for its poor communication and lack of coordination, it might be beneficial to include more flexibility and openness in the response system in future events. we suggest taking advice from the literature above conservatively because of the contextual differences in the event and setting. this study's limitations also deserve mention. several community detection methods have been developed with different assumptions for network partition. some algorithms take deterministic group finding approaches that partition the network based on betweenness centrality edges (girvan & newman, 2002) or information centrality edges (fortunato et al., 2004) . other algorithms take the optimization approaches we use in this article. in our side analyses, we tested three algorithms with the same data set: g-n, louvain, and leiden. the modularity scores were consistent, as reported in this article, but the number of communities in g-n and the other two algorithms differed. the deterministic group finding approach (g-n) found a substantively high number of communities. the modularity score can help make sense of the partition initially, but the approach is limited (reichardt & bornholdt, 2006) . thus, two questions remain: which algorithm do we choose and how do we know whether the community structure is robust (karrer, levina, & newman, 2008) ? in their nature, these questions do not differ from which statistical model to use given the assumptions and types of data in hand. the algorithms also require further examination and tests. while we reviewed the data sources carefully multiple times to capture the response coordination, communication, and collaboration, the process of collecting and cleaning data can never be free from human error. it was a time-consuming, labor-intensive process that required trial and error. further, the original written materials can have their own biases that reflect the source's perspective. government documents may provide richer information about the government's actions but less so about other social endeavors. media data, such as newspapers, also have their limitations as information sources to capture rich social networks. accordingly, our results must be interpreted in the context of these limitations. in conclusion, this article examines the community structure in a large ern, which is a quite new, but potentially fruitful, approach to the field. we tested a rapidly developing analytical approach to the ern to generate theoretical insights and find paths to exploit such insights for better public health emergency preparedness and response in the future. much work remains to build and refine the theoretical propositions on crisis response networks drawn from this rich case study. the katrina response network consisted of 1,577 organizations and 857 connections with a mean degree except for the quote, comfort and haase (2006) do not provide further explanation incident command system was established originally for the response to fire and has been expanded to other disaster areas in the end, we found that the process was not helpful because of the volume and redundancy of content in news articles different newspapers published, which is not an issue in analysis because it can be filtered and handled easily using network analysis tool. because we had not confronted previous disaster response studies that collected network data from text materials, such as news articles and situation reports, and reported their reliability we also classified organizations based on specialty, such as quarantine, economy, police, tourism, and so on regardless of jurisdictions. twenty-seven specialty areas were classified. we note that the result of diversity analysis using the 27 specialty areas did not differ from that using the 69 organizational groups. the correlation of the diversity indices based on the two different classification criteria was r ¼ .967. we report the result based on organization groups because the classification criterion can indicate better the different types of we did not measure the frequency, intensity, or quality of interorganizational relations but only the presence of either or both relations within the communities fast unfolding of communities in large networks organising for effective emergency management: lessons from research analyzing social networks network management in emergency response: articulation practices of state-level managers-interweaving up, down, and sideways interorganizational collaboration in the hurricane katrina response from linearity to complexity: emergent characteristics of the 2006 avian influenza response system in turkey comparing coordination structures for crisis management in six countries mission improbable: using fantasy documents to tame disaster crisis management in hindsight: cognition, coordination, communication communication, coherence, and collective action a comparison of three diversity indices based on their components of richness and evenness method to find community structures based on information centrality community structure in social and biological networks community structure in jazz a comparative study of infectious disease government in korea: what we can learn from the 2003 sars and the 2015 mers outbreak imagining twitter as an imagined community crisis communications in the age of social media: a network analysis of zika-related tweets crisis decision making: the centralization revisited global and domestic legal preparedness and response: 2014 ebola outbreak pandemic and all-hazards preparedness act disaster response preparedness coordination through social networks interorganizational coordination in dynamic context: networks in emergency response management examining intergovernmental and interorganizational response to catastrophic disasters: toward a network-centered approach collaborative decision-making in emergency and disaster management structure and network performance: horizontal and vertical networks in emergency management robustness of community structure in networks digital government and wicked problems subgroup analysis of an epidemic response network of organizations: 2015 mers outbreak in korea middle east respiratory syndrome coronavirus outbreak in the republic of korea the 2015 mers white paper. seoul, south korea: ministry of health and welfare efficient behavior of small-world networks blackhole: robust community detection inspired by graph drawing cliques, clubs and clans clustering and cohesion in networks: concepts and measures the network governance of crisis response: case studies of incident command systems finding and evaluating community structure in networks the structure of effective governance of disaster response networks: insights from the field when are networks truly modular? dynamics and control of diseases in networks with community structure public health resilience checklist for high-consequence infectious diseases-informed by the domestic ebola response in the united states epidemics crisis management systems in south korea infectious disease threats and opportunities for prevention extended structures of mediation: re-examining brokerage in dynamic networks ebola preparedness in the netherlands: the need for coordination between the public health and the curative sector leveraging intergovernmental and cross-sectoral networks to manage nuclear power plant accidents: a case study from from louvain to leiden: guaranteeing well-connected communities e-mail as spectroscopy: automated discovery of community structure within organizations social network analysis: methods and applications terrorism, homeland security and the national emergency management network a method for finding communities of related genes cdc's early response to a novel viral disease, middle east respiratory syndrome coronavirus structure and overlaps of communities in networks author biographies yushim kim is an associate professor at the school of public affairs at arizona state university and a coeditor of journal of policy analysis and management. her research examines environmental and urban policy issues and public health emergencies from a systems perspective jihong kim is a graduate student at the department of seong soo oh is an associate professor of public administration at hanyang university, korea. his research interests include public management and public sector human resource management he is an associate editor of information sciences and comsis journal. his research interests include data mining and databases her research focuses on information and knowledge management in the public sector and its impact on society, including organizational learning, the adoption of technology in the public sector, public sector data management, and data-driven decision-making in government jaehyuk cha is a professor at the department of computer and software, hanyang university, korea. his research interests include dbms, flash storage system the authors appreciate research assistance from jihyun byeon and useful comments from chan wang, haneul choi, and young jae won. the early idea of this article using partial data from news articles was presented at the 2019 dg.o research conference and published as conference proceeding (kim, kim, oh, kim, & ku, 2019) . data are available from the author at ykim@asu.edu upon request. we used python to employ the leiden community detection algorithm (see the source code: https://github.com/ vtraag/leidenalg). network measures, such as density and clustering coefficient, as well as the diversity index were calculated using python libraries (networkx, math, pandas, nump). we used gephi 0.9.2 for figures and mendeley for references. the authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. the authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: this work was supported by the national research foundation of korea grant funded by the korean government (ministry of science and ict; no. 2018r1a5a7059549). supplemental material for this article is available online. key: cord-103150-e9q8e62v authors: mishra, shreya; srivastava, divyanshu; kumar, vibhor title: improving gene-network inference with graph-wavelets and making insights about ageing associated regulatory changes in lungs date: 2020-11-04 journal: biorxiv doi: 10.1101/2020.07.24.219196 sha: doc_id: 103150 cord_uid: e9q8e62v using gene-regulatory-networks based approach for single-cell expression profiles can reveal un-precedented details about the effects of external and internal factors. however, noise and batch effect in sparse single-cell expression profiles can hamper correct estimation of dependencies among genes and regulatory changes. here we devise a conceptually different method using graph-wavelet filters for improving gene-network (gwnet) based analysis of the transcriptome. our approach improved the performance of several gene-network inference methods. most importantly, gwnet improved consistency in the prediction of generegulatory-network using single-cell transcriptome even in presence of batch effect. consistency of predicted gene-network enabled reliable estimates of changes in the influence of genes not highlighted by differential-expression analysis. applying gwnet on the single-cell transcriptome profile of lung cells, revealed biologically-relevant changes in the influence of pathways and master-regulators due to ageing. surprisingly, the regulatory influence of ageing on pneumocytes type ii cells showed noticeable similarity with patterns due to effect of novel coronavirus infection in human lung. inferring gene-regulatory-networks and using them for system-level modelling is being widely used for understanding the regulatory mechanism involved in disease and development. the interdependencies among variables in the network is often represented as weighted edges between pairs of nodes, where edge weights could represent regulatory interactions among genes. gene-networks can be used for inferring causal models [1] , designing and understanding perturbation experiments, comparative analysis [2] and drug discovery [3] . due to wide applicability of network inference, many methods have been proposed to estimate interdependencies among nodes. most of the methods are based on pairwise correlation, mutual information or other similarity metrics among gene expression values, provided in a different condition or time point. however, resulting edges are often influenced by indirect dependencies owing to low but effective background similarity in patterns. in many cases, even if there is some true interaction among a pair of nodes, its effect and strength is not estimated properly due to noise, background-pattern similarity and other indirect dependencies. hence recent methods have started using alternative approaches to infer more confident interactions. such alternative approach could be based on partial correlations [4] or aracne's method of statistical threshold of mutual information [5] . 1 single-cell expression profiles often show heterogeneity in expression values even in a homogeneous cell population. such heterogeneity can be exploited to infer regulatory networks among genes and identify dominant pathways in a celltype. however, due to the sparsity and ambiguity about the distribution of gene expression from single-cell rna-seq profiles, the optimal measures of gene-gene interaction remain unclear. hence recently, sknnider et al. [6] evaluated 17 measures of association to infer gene co-expression based network. in their analysis, they found two measures of association, namely phi and rho as having the best performance in predicting co-expression based gene-gene interaction using scrna-seq profiles. in another study, chen et al. [7] performed independent evaluation of a few methods proposed for genenetwork inference using scrna-seq profiles such as scenic [8] , scode [9] , pidc [10] . chen et al. found that for single-cell transcriptome profiles either generated from experiments or simulations, these methods had a poor performance in reconstructing the network. performance of such methods can be improved if gene-expression profiles are denoised. thus the major challenge of handling noise and dropout in scrna-seq profile is an open problem. the noise in single-cell expression profiles could be due to biological and technical reasons. the biological source of noise could include thermal fluctuations and a few stochastic processes involved in transcription and translation such as allele specific expression [11] and irregular binding of transcription factors to dna. whereas technical noise could be due to amplification bias and stochastic detection due to low amount of rna. raser and o'shea [12] used the term noise in gene expression as measured level of its variation among cells supposed to be identical. raser and o'shea categorised potential sources of variation in geneexpression in four types : (i) the inherent stochasticity of biochemical processes due to small numbers of molecules; (ii) heterogeneity among cells due to cell-cycle progression or a random process such as partitioning of mitochondria (iii) subtle micro-environmental differences within a tissue (iv) genetic mutation. overall noise in gene-expression profiles hinders in achieving reliable inference about regulation of gene activity in a cell-type. thus, there is demand for pre-processing methods which can handle noise and sparsity in scrna-seq profiles such that inference of regulation can be reliable. the predicted gene-network can be analyzed further to infer salient regulatory mechanisms in a celltype using methods borrowed from graph theory. calculating gene-importance in term of centrality, finding communities and modules of genes are common downstream analysis procedures [2] . just like gene-expression profile, inferred gene network could also be used to find differences in two groups of cells(sample) [13] to reveal changes in the regulatory pattern caused due to disease, environmental exposure or ageing. in particular, a comparison of regulatory changes due to ageing has gained attention recently due to a high incidence of metabolic disorder and infection based mortality in the older population. especially in the current situation of pandemics due to novel coronavirus (sars-cov-2), when older individuals have a higher risk of mortality, a question is haunting researchers. that question is: why old lung cells have a higher risk of developing severity due to sars-cov-2 infection. however, understanding regulatory changes due to ageing using gene-network inference with noisy single-cell scrna-seq profiles of lung cells is not trivial. thus there is a need of a noise and batch effect suppression method for investigation of the scrna-seq profile of ageing lung cells [14] using a network biology approach. here we have developed a method to handle noise in gene-expression profiles for improving genenetwork inference. our method is based on graphwavelet based filtering of gene-expression. our approach is not meant to overlap or compete with existing network inference methods but its purpose is to improve their performance. hence, we compared other output of network inference methods with and without graph-wavelet based pre-processing. we have evaluated our approach using several bulk sample and single-cell expression profiles. we further investigated how our denoising approach influences the estimation of graph-theoretic properties of gene-network. we also asked a crucial question: how the gene regulatory-network differs between young and old individual lung cells. further, we compared the pattern in changes in the influence of genes due to ageing with differential expression in covid infected lung. our method uses a logic that cells (samples) which are similar to each other, would have a more similar expression profile for a gene. hence, we first make a network such that two cells are connected by an edge if one of them is among the top k nearest neighbours (knn) of the other. after building knn-based network among cells (samples), we use graph-wavelet based approach to filter expression of one gene at a time (see fig. 1 ). for a gene, we use its expression as a signal on the nodes of the graph of cells. we apply a graph-wavelet transform to perform spectral decomposition of graph-signal. after graph-wavelet transformation, we choose the threshold for wavelet coefficients using sureshrink and bayesshrink or a default percentile value determined after thorough testing on multiple data-sets. we use the retained values of the coefficient for inverse graph-wavelet transformation to reconstruct a filtered expression matrix of the gene. the filtered gene-expression is used for gene-network inference and other down-stream process of analysis of regulatory differences. for evaluation purpose, we have calculated inter-dependencies among genes using 5 different co-expression measurements, namely pearson and spearman correlations, φ and ρ scores and aracne. the biological and technical noise can both exist in a bulk sample expression profile ( [12] ). in order to test the hypothesis that graph-based denoising could improve gene-network inference, we first evaluated the performance of our method on bulk expression data-set. we used 4 data-sets made available by dream5 challenge consortium [15] . three data-sets were based on the original expression profile of bacterium escherichia coli and the single-celled eukaryotes saccharomyces cerevisiae and s aureus. while the fourth data-set was simulated using in silico network with the help of genenetweaver, which models molecular noise in transcription and translation using chemical langevin equation [16] . the true positive interactions for all the four data-sets are also available. we compared graph fourier based low passfiltering with graph-wavelet based denoising using three different approaches to threshold the waveletcoefficients. we achieved 5 -25 % improvement in score over raw data based on dream5 criteria [15] with correlation, aracne and rho based network prediction. with φ s based gene-network prediction, there was an improvement in 3 out of 4 dream5 data-sets ( fig. 2a) . all the 5 network inference methods showed improvement after graphwavelet based denoising of simulated data (in silico) from dream5 consortium ( fig. 2a) . moreover, graph-wavelet based filtering had better performance than chebyshev filter-based low pass filtering in graph fourier domain. it highlights the fact that even bulk sample data of gene-expression can have noise and denoising it with graph-wavelet after making knn based graph among samples has the potential to improve gene-network inference. moreover, it also highlights another fact, well known in the signal processing field, that wavelet-based filtering is more adaptive than low pass-filtering. in comparison to bulk samples, there is a higher level of noise and dropout in single-cell expression profiles. dropouts are caused by non-detection of true expression due to technical issues. using low-pass filtering after graph-fourier transform seems to be an obvious choice as it fills in a background signal at missing values and suppresses high-frequency outlier-signal [17] . however, in the absence of information about cell-type and cellstates, a blind smoothing of a signal may not prove to be fruitful. hence we applied graph-wavelet based filtering for processing gene-expression dataset from the scrna-seq profile. we first used scrna-seq data-set of mouse embryonic stem cells (mescs) [18] . in order to evaluate network inference in an unbiased manner, we used gene regulatory interactions compiled by another research group [19] . our approach of graph-wavelet based pre-processing of mesc scrna-seq data-set improved the performance of gene-network inference methods by 8-10 percentage (fig. 2b) . however, most often, the gold-set of interaction used for evaluation of gene-network inference is incomplete, which hinders the true assessment of improvement. figure 1 : the flowchart of gwnet pipeline. first, a knn based network is made between samples/cell. a filter for graph wavelet is learned for the knn based network of samples/cells. gene-expression of one gene at a time is filtered using graph-wavelet transform. filtered gene-expression data is used for network inference. the inferred network is used to calculate centrality and differential centrality among groups of cells. figure 2 : improvement in gene-network inference by graph-wavelet based denoising of gene-expression (a) performance of network inference methods using bulk gene-expression data-sets of dream5 challenge. three different ways of shrinkage of graph-wavelet coefficients were compared to graph-fourier based low pass filtering. the y-axis shows fold change in area under curve(auc) for receiver operating characteristic curve (roc) for overlap of predicted network with golden-set of interactions. for hard threshold, the default value of 70% percentile was used. (b) performance evaluation using single-cell rna-seq (scrna-seq) of mouse embryonic stem cells (mescs) based network inference after filtering the gene-expression. the gold-set of interactions was adapted from [19] (c) comparison of graph wavelet-based denoising with other related smoothing and imputing methods in terms of consistency in the prediction of the gene-interaction network. here, phi (φ s ) score was used to predict network among genes. for results based on other types of scores see supplementary figure s1 . predicted networks from two scrna-seq profile of mesc were compared to check robustness towards the batch effect. hence we also used another approach to validate our method. for this purpose, we used a measure of overlap among network inferred from two scrna-seq data-sets of the same cell-type but having different technical biases and batch effects. if the inferred networks from both data-sets are closer to true gene-interaction model, they will show high overlap. for this purpose, we used two scrnaseq data-set of mesc generated using two different protocols(smartseq and drop-seq). for comparison of consistency and performance, we also used a few other imputation and denoising methods proposed to filter and predict the missing expression values in scrna-seq profiles. we evaluated 7 other such methods; graph-fourier based filtering [17] , magic [20] , scimpute [21] , dca [22] , saver [23] , randomly [24] , knn-impute [25] . graphwavelet based denoising provided better improvement in auc for overlap of predicted network with known interaction than other 7 methods meant for imputing and filtering scrna-seq profiles (supplementary figure s1a ). similarly in comparison to graph-wavelet based denoising, the other 7 methods did not provided substantial improvement in auc for overlap among gene-network inferred by two data-sets of mesc (fig. 2c , supplementary figure s1b ). however, graph wavelet-based filtering improved the overlap between networks inferred from different batches of scrna-seq profile of mesc even if they were denoised separately (fig. 2c , supplementary figure s1b ). with φ s based edge scores the overlap among predicted gene-network increased by 80% due to graph-wavelet based denoising (fig. 2c ). the improvement in overlap among networks inferred from two batches hints that graph-wavelet denoising is different from imputation methods and has the potential to substantially improve gene-network inference using their expression profiles. improved gene-network inference from single-cell profile reveal agebased regulatory differences improvement in overlap among inferred genenetworks from two expression data-set for a cell type also hints that after denoising predicted networks are closer to true gene-interaction profiles. hence using our denoising approach before estimat-ing the difference in inferred gene-networks due to age or external stimuli could reflect true changes in the regulatory pattern. such a notion inspired us to compare gene-networks inferred for young and old pancreatic cells using their scrna-seq profile filtered by our tool [26] . martin et al. defined three age groups, namely juvenile ( 1month-6 years), young adult (21-22 years) and aged (38-54 years) [26] . we applied graph-wavelet based denoising of pancreatic cells from three different groups separately. in other words, we did not mix cells from different age groups while denoising. graph-wavelet based denoising of a singlecell profile of pancreatic cells caused better performance in terms of overlap with protein-protein interaction (ppi) (fig. 3a , supplementary figure s2a ). even though like chen et al. [7] we have used ppi to measure improvement in genenetwork inference, it may not be reflective of all gene-interactions. hence we also used the criteria of increase in overlap among predicted networks for same cell-types to evaluate our method for scrnaseq profiles of pancreatic cells. denoising scrnaseq profiles also increased overlap between inferred gene-network among pancreatic cells of the old and young individuals (fig. 3b , supplementary figure s2b ). we performed quantile normalization of original and denoised expression matrix taking all 3 age groups together to bring them on the same scale to calculate the variance of expression across cells of every gene. the old and young pancreatic alpha cells had a higher level of median variance of expression of genes than juvenile. however, after graph-wavelet based denoising, the variance level of genes across all the 3 age groups became almost equal and had similar median value (fig. 3c ). notice that, it is not trivial to estimate the fraction of variances due to transcriptional or technical noise. nonetheless, graph-wavelet based denoising seemed to have reduced the noise level in single-cell expression profiles of old and young adults. differential centrality in the co-expression network has been used to study changes in the influence of genes. however, noise in single-cell expression profiles can cause spurious differences in centrality. hence we visualized the differential degree of genes in network inferred using young and old cells scrna-seq profiles. the networks inferred from non-filtered expression had a much higher number of non-zero differential degree values in comparison to the de-noised version (fig. 3d, supplementary figure s2c ). thus denoising seems to reduce differences among centrality, which could be due to randomness of noise. next, we analyzed the properties of genes whose variance dropped most due to graphwavelet based denoising. surprisingly, we found that top 500 genes with the highest drop in variance due to denoising in old pancreatic beta cells were significantly associated with diabetes mellitus and hyperinsulinism. whereas, top 500 genes with the highest drop in variance in young pancreatic beta cells had no or insignificant association with diabetes (fig. 3e) . a similar trend was observed with pancreatic alpha cells (supplementary figure s2d ) . such a result hint that ageing causes increase in stochasticity of the expression level of genes associated with pancreas function and denoising could help in properly elucidating their dependencies with other genes. improvement in gene-network inference for studying regulatory differences among young and old lung cells. studying cell-type-specific changes in regulatory networks due to ageing has the potential to provide better insight about predisposition for disease in the older population. hence we inferred genenetwork for different cell-types using scrna-seq profiles of young and old mouse lung cells published by kimmel et al. [14] .the lower lung epithelia where a few viruses seem to have the most deteriorating effect consists of multiple types of cells such as bronchial epithelial and alveolar epithelial cells, fibroblast, alveolar macrophages, endothelial and other immune cells. the alveolar epithelial cells, also called as pneumocytes are of two major types. the type 1 alveolar (at1) epithelial cells for major gas exchange surface of lung alveolus has an important role in the permeability barrier function of the alveolar membrane. type 2 alveolar cells (at2) are the progenitors of type 1 cells and has the crucial role of surfactant production. at2 cells ( or pneumocytes type ii) cells are a prime target of many viruses; hence it is important to understand the regulatory patterns in at2 cells, especially in the context of ageing. we applied our method of denoising on scrnaseq profiles of cells derived from old and young mice lung [14] . graph wavelet based denoising lead to an increase in consistency among inferred genenetwork for young and old mice lung for multiple cell-types (fig. 4a) . graph-wavelet based denoising also lead to an increase in consistency in predicted gene-network from data-sets published by two different groups (fig. 4b) . the increase in overlap of gene-networks predicted for old and young cells scrna-seq profile, despite being denoised separately, hints about a higher likelihood of predicting true interactions. hence the chances of finding gene-network based differences among old and young cells were less likely to be dominated by noise. we studied ageing-related changes in pagerank centrality of nodes(genes). since pagerank centrality provides a measure of "popularity" of nodes, studying its change has the potential to highlight the change in the influence of genes. first, we calculated differential pagerank of genes among young and old at2 cells (supporting file-1) and performed gene-set enrichment analysis using enrichr [27] . the top 500 genes with higher pagerank in young at2 cells had enriched terms related to integrin signalling, 5ht2 type receptor mediated signalling, h1 histamine receptor-mediated signalling pathway, vegf, cytoskeleton regulation by rho gtpase and thyrotropin activating receptor signalling (fig. 4c) . we ignored oxytocin and thyrotropin-activating hormone-receptor mediated signalling pathways as an artefact as the expression of oxytocin and trh receptors in at2 cells was low. moreover, genes appearing for the terms "oxytocin receptor-mediated signalling" and "thyrotropin activating hormone-mediated signalling" were also present in gene-set for 5ht2 type receptormediated signalling pathway. we found literature support for activity in at2 cells for most of the enriched pathways. however, there were very few studies which showed their differential importance in old and young cells, such as bayer et al. demonstrated mrna expression of several 5-htr including 5-ht2, 5ht3 and 5ht4 in alveolar epithelial cells type ii (at2) cells and their role in calcium ion mobilization. similarly, chen et al. [28] showed that histamine 1 receptor antagonist reduced pulmonary surfactant secretion from adult rat alveolar at2 cells in primary culture. vegf pathway is active in at2 cells, and it is known that ageing has an effect on vegf mediated angiogenesis in lung. moreover, vegf based angiogenesis is for comparing two networks it is important to reduce differences due to noise. hence the plot here shows similarity of predicted networks before and after graph-wavelet based denoising. the result shown here are for correlation-based co-expression network, while similar results are shown using ρ score in supplementary figure s2 . (c) variances of expression of genes across single-cells before and after denoising (filtering) is shown here. variances of genes in a cell-type was calculated separately for 3 different stages of ageing (young, adult and old). the variance (estimate of noise) is higher in older alpha and beta cells compared to young. however, after denoising variance of genes in all ageing stage becomes equal (d) effect of noise in estimated differential centrality is shown is here. the difference in the degree of genes in network estimated for old and young pancreatic beta cells is shown here. the number of non-zero differential-degree estimated using denoised expression is lower than unfiltered expression based networks.(e) enriched panther pathway terms for top 500 genes with the highest drop in variance after denoising in old and young pancreatic beta cells. known to decline with age [29] . we further performed gene-set enrichment analysis for genes with increased pagerank in older mice at2 cells. for top 500 genes with higher pagerank in old at2 cells, the terms which appeared among 10 most enriched in both kimmel et al. and angelids et al. data-sets were t cell activation, b cell activation, cholesterol biosynthesis and fgf signaling pathway, angiogenesis and cytoskeletal regulation by rho gtpase (fig. 4d) . thus, there was 60% overlap in results from kimmel et al. and angelids et al. data-sets in terms of enrichment of pathway terms for genes with higher pagerank in older at2 cells (supplementary figure s3a , supporting file-2, supporting file-3). overall in our analysis, inflammatory response genes showed higher importance in older at2 cells. the increase in the importance of cholesterol biosynthesis genes hand in hand with higher inflammatory response points towards the influence of ageing on the quality of pulmonary surfactants released by at2. al saedy et al. recently showed that high level of cholesterol amplifies defects in surface activity caused by oxidation of pulmonary surfactant [30] . we also performed enrichr based analysis of differentially expressed genes in old at2 cells (supporting file-4). for genes up-regulated in old at2 cells compared to young, terms which reappeared were cholesterol biosynthesis, t cell and b cell activation pathways, angiogenesis and inflammation mediated by chemokine and cytokine signalling. whereas few terms like ras pathway, jak/stat signalling and cytoskeletal signalling by rho gt-pase did not appear as enriched for genes upregulated in old at2 cells ( figure 3b , supporting file-4). however previously, it has been shown that the increase in age changes the balance of pulmonary renin-angiotensin system (ras), which is correlated with aggravated inflammation and more lung injury [31] . jak/stat pathway is known to be involved in the oxidative-stress induced decrease in the expression of surfactant protein genes in at2 cells [32] . overall, these results indicate that even though the expression of genes involved in relevant pathways may not show significant differences due to ageing, but their regulatory influence could be changing substantially. in order to further gain insight, we analyzed the changes in the importance of transcription factors in ageing at2 cells. among top 500 genes with higher pagerank in old at2 cells, we found several relevant tfs. however, to make a stringent list, we considered only those tfs which had nonzero value for change in degree among gene-network for old and young at2 cells. overall, with kimmel at el. data-set, we found 46 tfs with a change in pagerank and degree (supplementary table-1) due to ageing for at2 cells (fig. 4e) . the changes in centrality (pagerank and degree) of tfs with ageing was coherent with pathway enrichment results. such as etv5 which has higher degree and pagerank in older cells, is known to be stabilized by ras signalling in at2 cells [33] . in the absence of etv5 at2 cell differentiate to at1 cells [33] . another tf jun (c-jun) having stronger influence in old at2 cells, is known to regulate inflammation lung alveolar cells [34] . we also found jun to be having co-expression with jund and etv5 in old at2 cell (supplementary figure s4) . jund whose influence seems to increase in aged at2 cells is known to be involved in cytokine-mediated inflammation. among the tfs stat 1-4 which are involved in jak/stat signalling, stat4 showed higher degree and pagerank in old at2. androgen receptor(ar) also seem to have a higher influence in older at2 cells (fig. 4e ). androgen receptor has been shown to be expressed in at2 cells [35] . we further performed a similar analysis for the scrna-seq profile of interstitial macrophages(ims) in lungs and found literature support for the activity of enriched pathways (supporting file-5). whereas gene-set enrichment output for important genes in older ims had some similarity with results from at2 cells as both seem to have higher pro-inflammatory response pathway such as t cell activation and jak/stat signalling. however, unlike at2 cells, ageing in ims seem to cause an increase in glycolysis and pentose phosphate pathway. higher glycolysis and pentose phosphate pathway activity levels have been previously reported to be involved in the pro-inflammatory response in macrophages by viola et al. [36] . in our results, ras pathway was not enriched significantly for genes with a higher importance in older macrophages. such results show that the pro-inflammatory pathways activated due to aging could vary among different cell-types in lung. for the same type of cells, the predicted networks for old and young cells seem to have higher overlap after graph-wavelet based filtering. the label "raw" here means that, both networks (for old and young) were inferred using unfiltered scrna-seq profiles. wheres the same result from denoised scrna-seq profile is shown as filtered. networks were inferred using correlation-based co-expression. in current pandemic due to sars-cov-2, a trend has emerged that older individuals have a higher risk of developing severity and lung fibrosis than the younger population. since our analysis revealed changes in the influence of genes in lung cells due to ageing, we compared our results with expression profiles of lung infected with sars-cov-2 published by blanco-melo et al. [37] . recently it has been shown that at2 cells predominantly express ace2, the host cell surface receptor for sars-cov-2 attachment and infection [38] . thus covid infection could have most of the dominant effect on at2 cells. we found that genes with significant upregulation in sars-cov-2 infected lung also had higher pagerank in gene-network inferred for older at2 cells (fig. 5a) . we also repeated the process of network inference and calculating differential centrality among old and young using all types of cells in the lung together (supporting file-6). we performed gene-set enrichment for genes up-regulated in sars-cov-2 infected lung. majority of the 7 panther pathway terms enriched for genes up-regulated in sars-cov-2 infected lung also had enrichment for genes with higher pagerank in old lung cells (combined). total 6 out of 7 significantly enriched panther pathways for genes up-regulated in covid-19 infected lung, were also enriched for genes with higher pagerank in older at2 cells in either of the two data-sets used here (5 in angelids et al., 3 in kimmel et al. data-based results). among the top 10 enriched wikipathway terms for genes up-regulated in covid infected lung, 7 has significant enrichment for genes with higher pagerank in old at2 cells (supporting file-7). however, the term type-ii interferon signalling did not have significant enrichment for genes with higher pagerank in old at2 cells. we further investigated enriched motifs of transcription factors in promoters of genes up-regulated in covid infected lungs (supplementary methods). for promoters of genes up-regulated in covid infected lung top two enriched motifs belonged to irf (interferon regulatory factor) and ets family tfs. notice that etv5 belong to sub-family of ets groups of tfs. further analysis also revealed that most of the genes whose expression is positively cor-related with etv5 in old at2 cells is up-regulated in covid infected lung. in contrast, genes with negative correlation with etv5 in old at2 cells were mostly down-regulated in covid infected lung. a similar trend was found for stat4 gene. however, for erg gene with higher pagerank in young at2 cell, the trend was the opposite. in comparison to genes with negative correlation, positively correlated genes with erg in old at2 cell, had more downregulation in covid infected lung. such trend shows that a few tfs like etv5, stat4 with higher pagerank in old at2 cells could be having a role in poising or activation of genes which gain higher expression level on covid infection. inferring regulatory changes in pure primary cells due to ageing and other conditions, using singlecell expression profiles has tremendous potential for various applications. such applications could be understanding the cause of development of a disorder or revealing signalling pathways and master regulators as potential drug targets. hence to support such studies, we developed gwnet to assist biologists in work-flow for graph-theory based analysis of single-cell transcriptome. gwnet improves inference of regulatory interaction among genes using graph-wavelet based approach to reduce noise due to technical issues or cellular biochemical stochasticity in gene-expression profiles. we demonstrated the improvement in gene-network inference using our filtering approach with 4 benchmark data-sets from dream5 consortium and several single-cell expression profiles. using 5 different ways for inferring network, we showed how our approach for filtering gene-expression can help genenetwork inference methods. our results of comparison with other imputation, smoothing methods and graph-fourier based filtering showed that graph-wavelet is more adaptive to changes in the expression level of genes with changing neighborhood of cells. thus graph-wavelet based denoising is a conceptually different approach for preprocessing of gene-expression profiles. there is a huge body of literature on inferring gene-networks from bulk gene-expression profile and utilizing it to find differences among two groups of samples. however, applying classical procedures on singleshown for erg, which have higher pagerank in young at2 cells. most of the genes which had a positive correlation with etv5 and stat4 expression in old murine at2 cells were up-regulated in covid infected lung. whereas for erg the trend is the opposite. genes positively correlated with erg genes in old at2 had more down-regulation than genes with negative correlation. such results hint that tfs whose influence (pagerank) increase during ageing could be involved activating or poising the genes up-regulated in covid infection. cell transcriptome profiles has not proved to be effective. our method seems to resolve this issue by increasing consistency and overlap among gene-networks inferred using an expression from different sources (batches) for the same cell-type even if each data-sets was filtered independently. such an increase in overlap among predicted network from independently processed data-sets from different sources hint that estimated dependencies among genes reach closer to true values after graphwavelet based denoising of expression profiles. having network prediction closer to true values increases the reliability of comparison of a regulatory pattern among two groups of cells. moreover, recently chow and chen [39] have shown that age-associated genes identified using bulk expression profiles of the lung are enriched among those induced or suppressed by sars-cov-2 infection. however, they did not perform analysis with systems-level approach. our analysis highlighted ras and jak/stat pathways to be enriched for genes with stronger influence in old at2 cells and genes up-regulated in covid infected lung. ras/mapk signalling is considered essential for self-renewal of at2 cell [33] . similarly, jak/stat pathway is known to be activated in the lung during injury [40] and influence surfactant quality [32] . we have used murine aging-lung scrna-seq profiles however our analysis provides an important insight that regulatory patterns and master-regulators in old at2 cells are in such a configuration that they could be predisposing it for a higher level of ras and jak/stat signalling. androgen receptor (ar) which has been implicated in male pattern baldness and increased risk of males towards covid infection [41] had higher pagerank and degree in old at2 cells. however, further investigation is needed to associate ar with severity on covid infection due to ageing. on the other hand, in young at2 cells, we find a high influence of genes involved in histamine h1 receptor-mediated signalling, which is known to regulate allergic reactions in lungs [42] . another benefit of our approach of analysis is that it can highlight a few specific targets of further study for therapeutics. such as a kinase that binds and phosphorylates c-jun called as jnk is being tested in clinical trials for pulmonary fibrosis [43] . androgen deprivation therapy has shown to provide partial protection against sars-cov-2 infection [44] . on the same trend, our analysis hints that etv5 could also be considered as drug-target to reduce the effect of ageing induced ras pathway activity in the lung. we used the term noise in gene-expression according to its definition by several researchers such as raser and o'shea [12] ; as the measured level of variation in gene-expression among cells supposed to be identical. hence we first made a base-graph (networks) where supposedly identical cells are connected by edges. for every gene we use this basegraph and apply graph-wavelet transform to get an estimate of variation of its expression in every sample (cells) with respect to other connected samples at different levels of graph-spectral resolution. for this purpose, we first calculated distances among samples (cells). to get a better estimate of distances among samples (cells) one can perform dimension reduction of the expression matrix using tsne [45] or principal component analysis. we considered every sample (cell) as a node in the graph and connected two nodes with an edge only when one of them was among k-nearest neighbors of the other. here we decide the value of k in the range of 10-50, based on the number of samples(cells) in the expression data-sets. thus we calculated the preliminary adjacency matrix using k-nearest neighbours (knn) based on euclidean distance metric between samples of the expression matrix. we used this adjacency matrix to build a base-graph. thus each vertex in the base-graph corresponds to each sample and edge weights to the euclidean distance between them. the weighted graph g built using knn based adjacency matrix comprises of a finite set of vertices v which corresponds to cells (samples), a set of edges e denoting connection between samples (if exist) and a weight function which gives nonnegative weighted connections between cells (samples). this weighted matrix can also be defined as a n xn (n being number of cells) weighted adjacency matrix a where a ij is 0 if there is no edge between cells i and j , otherwise a ij = weight(i, j) if there exist an edge between i, j. the degree of a cell in the graph is the sum of weights of edges incident on that cell. also, diagonal degree matrix d of this graph comprises of degree d(i) if i = j, 0 otherwise. a non-normalized graph laplacian operator l for a graph is defined as l = d − a. the normalized form of graph laplacian operator is defined as : both laplacian operators produce different eigenvectors [46] . however, we have used a normalized form of laplacian operator for the graph between cells. the graph laplacian is further used for graph fourier transformation of signals on nodes (see supplementary methods) ([47] [46] ). for filtering in the fourier domain, we used chebyshev-filter for gene expression profile. we took the expression of each gene at a time considering it as a signal and projected it onto the raw graph (where each vertex corresponds to each sample) object [17] . we took forward fourier transform of signal and filtered the signal using chebyshev filter in the fourier domain and then inverse transformed the signal to calculate filtered expression. this same procedure was repeated for every gene. this would finally give us filtered gene expression. spectral graph wavelet entails choosing a nonnegative real-valued kernel function which can behave as a bandpass filter and is similar to fourier transform. the re-scaled kernel function of graph laplacian gives wavelet operator which eventually produce graph wavelet coefficients at each scale. however, using continuous functional calculus one can define a function of self adjoint operator on the basis of spectral representation of graph. although for a graph with finite dimensional laplacian, this can be achieved by eigenvalues and eigenvectors of laplacian l [47] . the wavelet operator is given by t g = g(l). t g f gives wavelet coefficients for a signal f at scale = 1. this operator operates on eigenvectors u l as t g u l = g(λ l )u l . hence, for any graph signal, operator t g operates on the signal by adjusting each graph fourier coefficient as and inverse fourier transform given as the wavelet operator at every scale s is given as t s g = g(sl). these wavelet operators are localized to obtain individual wavelets by applying them to δ n , with δ n being a signal with 1 on vertex n and zero otherwise [47] . thus considering coefficients at every scale, the inverse transform can be obtained as here, in spite of filtering in fourier domain, we took wavelet coefficients of each gene expression signal at different scales. thresholding was applied on each scale to filter wavelet coefficients. we applied both hard and soft thresholding on wavelet coefficients. for soft thresholding, we implemented well-known methods sure shrink and bayes shrink. finding an optimal threshold for wavelet coefficients for denoising linear-signals and images has remained a subject of intensive research. we evaluated both soft and hard thresholding approaches and tested an information-theoretic criterion known as the minimum description length principle (mdl). using our tool gwnet, user can choose from multiple options of finding threshold such as visushrink, sureshrink and mdl. here, we have used hard-thresholding for most the data-sets as proper soft-thresholding of graph-wavelet coefficient is itself a topic of intensive research and may need further fine-tuning. one can also use hardthreshold value based on the best overlap among predicted gene-network and protein-protein interaction (ppi). while applying it on multiple datasets we realized that threshold cutoffs estimated by mdl criteria and best overlap of predicted network with known interaction and ppi, were in the range of 60-70 percentile. for comparing predicted network from multiple data-sets, we needed uniform percentile cutoff to threshold graph-wavelet coefficients. hence for uniform analysis of several datasets, we have set the default threshold value of 70 percentile. hence in default mode, wavelet coefficient with absolute value less than 70 percentile was made equal to zero. gwnet tool is flexible, and any network inferences method can be plugged in it for making regulatory inferences using a graph-theoretic approach. here, for single-cell rna-seq data, we have used gene-expression values in the form of fpkm (fragments per kilobase of exon model per million reads mapped). we pre-processed single-cell gene expression by quantile normalization and log transformation. to start with, we used spearman and pearson correlation to achieve a simple estimate of the measure of inter-dependencies among genes. we also used aracne ( algorithm for the reconstruction of accurate cellular networks) to infer network among genes. aracne first computes mutual information for each gene-pair. then it considers all possible triplet of genes and applies the data processing inequality (dpi) to remove indirect interactions. according to dpi, if gene i and gene j do not interact directly with each other but show dependency via gene k, the following inequality hold where i(g i , g j ) represents mutual information between gene i and gene j. aracne also removes interaction with mutual information less than a particular threshold eps. we have used eps value to recently skinnider et al., [6] showed superiority of two measures of proportionality rho(ρ) and phi(φ s ) [48] for estimating gene-coexpression network using single-cell transcriptome profile. hence we also evaluated the benefit of graph-wavelet based denoising of gene-expression with measures of proportionality ρ and φ s . the measures of proportionality φ can be defined as φ(g i , g j ) = var(g i − g j ) var(g i ) where g i is the vector containing log values of expression of a gene i across multiple samples (cells) and var() represents variance function. the symmetric version of φ can be written as whereas rho can be defined as to estimate both measures of proportionality, ρ and φ, we used 'propr' package2.0 [49] . the networks inferred from filtered and unfiltered gene-expression were compared to the ground truth. ground truth for dream5 challenge dataset was already available while for single-cell expression, we assembled the ground truth from hip-pie (human integrated protein-protein interaction reference) database [50] . we considered all edges possible in network, sorted them based on the significance of edge weights. we calculated the area under the receiver operator curve for both raw and filtered networks by comparing against edges in the ground truth. receiver operator is a standard performance evaluation metrics from the field of machine learning, which has been used in the dream5 evaluation method with some modifications. the modification for receiver operating curve here is that for x-axis instead of false-positive rate, we used a number of edges sorted according to their weights. for evaluation all possible edges sorted based on their weights in network are taken from the gene-network inferred from filtered and raw graphs. we calculated improvement by measuring fold change between raw and filtered scores. we compared the results of our approach of graphwavelet based denoising with other methods meant for imputation or reducing noise in scrna-seq profiles. for comparison we used graph-fourier based filtering [17] , magic [20] , scimpute [21] , dca [22] , saver [23] , randomly [24] , knn-impute [25] . brief descriptions and corresponding parameters used for other methods are written in supplementary method. the bulk gene-expression data used here evaluation was download from dream5 portal (http://dreamchallenges.org/project/dream-5network-inference-challenge/). the single-cell expression profile of mesc generated using different protocols [18] was downloaded for geo database (geo id: gse75790). single-cell expression profile of pancreatic cells from individuals with different age groups was downloaded from geo database (geo id:gse81547). the scrna-seq profile of murine aging lung published by kimmel et al. [14] is available with geo id : gse132901. while aging lung scrna-seq data published by angelids et al. [51] is available with geo id: gse132901. the code for graph-wavelet based filtering of gene-expression is available at http://reggen. iiitd.edu.in:1207/graphwavelet/index.html. the codes are present at https://github. com/reggenlab/gwnet/ and supporting files are present at https://github.com/reggenlab/ gwnet/tree/master/supporting$_$files. an integrative approach for causal gene identification and gene regulatory pathway inference singlecell transcriptomics unveils gene regulatory network plasticity chemogenomic profiling of plasmodium falciparum as a tool to aid antimalarial drug discovery supervised, semi-supervised and unsupervised inference of gene regulatory networks reverse engineering cellular networks evaluating measures of association for single-cell transcriptomics evaluating methods of inferring gene regulatory networks highlights their lack of performance for single cell gene expression data scenic: single-cell regulatory network inference and clustering scode: an efficient regulatory network inference algorithm from single-cell rna-seq during differentiation gene regulatory network inference from single-cell data using multivariate information measures characterizing noise structure in single-cell rna-seq distinguishes genuine from technical stochastic allelic expression noise in gene expression: origins, consequences, and control, science comparative assessment of differential network analysis methods murine single-cell rna-seq reveals cellidentity-and tissue-specific trajectories of aging wisdom of crowds for robust gene network inference genenetweaver: in silico benchmark generation and performance profiling of network inference methods enhancing experimental signals in single-cell rna-sequencing data using graph signal processing comparative analysis of single-cell rna sequencing methods a gene regulatory network in mouse embryonic stem cells recovering gene interactions from single-cell data using data diffusion an accurate and robust imputation method scimpute for single-cell rna-seq data single-cell rna-seq denoising using a deep count autoencoder saver: gene expression recovery for singlecell rna sequencing a random matrix theory approach to denoise single-cell data missing value estimation methods for dna microarrays single-cell analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns enrichr: interactive and collaborative html5 gene list enrichment analysis tool histamine stimulation of surfactant secretion from rat type ii pneumocytes aging impairs vegf-mediated, androgen-dependent regulation of angiogenesis dysfunction of pulmonary surfactant mediated by phospholipid oxidation is cholesterol-dependent age-dependent changes in the pulmonary renin-angiotensin system are associated with severity of lung injury in a model of acute lung injury in rats mapk and jak-stat signaling pathways are involved in the oxidative stress-induced decrease in expression of surfactant protein genes transcription factor etv5 is essential for the maintenance of alveolar type ii cells, proceedings of the national academy of sciences of the united states of targeted deletion of jun/ap-1 in alveolar epithelial cells causes progressive emphysema and worsens cigarette smoke-induced lung inflammation androgen receptor and androgen-dependent gene expression in lung the metabolic signature of macrophage responses imbalanced host response to sars-cov-2 drives development of covid-19 single cell rna sequencing of 13 human tissues identify cell types and receptors of human coronaviruses the aging transcriptome and cellular landscape of the human lung in relation to sars-cov-2 jak-stat pathway activation in copd, the european androgen hazards with covid-19 the h1 histamine receptor regulates allergic lung responses late breaking abstract -evaluation of the jnk inhibitor, cc-90001, in a phase 1b pulmonary fibrosis trial androgen-deprivation therapies for prostate cancer and risk of infection by sars-cov-2: a population-based study (n = 4532) visualizing data using t-sne discrete signal processing on graphs: frequency analysis wavelets on graphs via spectral graph theory how should we measure proportionality on relative gene expression data? propr: an r-package for identifying proportionally abundant features using compositional data analysis hippie v2.0: enhancing meaningfulness and reliability of protein-protein interaction networks an atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics we thank dr gaurav ahuja for providing us valuable advice on analysis of single-cell expression profile of ageing cells. none declared.vibhor kumar is an assistant professor at iiit delhi, india. he is also an adjunct scientist at genome institute of singapore. his interest include genomics and signal processing.divyanshu srivastava completed his thesis on graph signal processing for masters degree at computational biology department in iiit delhi, india. he has applied graph signal processing on protein structures and gene-expression data-sets.shreya mishra is a phd student at computational biology department in iiit delhi, india. her interest include data sciences and genomics. • we found that graph-wavelet based denoising of gene-expression profiles of bulk samples and singlecells can substantially improve gene-regulatory network inference.• more consistent prediction of gene-network due to denoising lead to reliable comparison of predicted networks from old and young cells to study the effect of ageing using single-cell transcriptome.• our analysis revealed biologically relevant changes in regulation due to aging in lung pneumocyte type ii cells, which had similarity with effects of covid infection in human lung.• our analysis highlighted influential pathways and master regulators which could be topic of further study for reducing severity due to ageing. key: cord-002929-oqe3gjcs authors: strano, emanuele; viana, matheus p.; sorichetta, alessandro; tatem, andrew j. title: mapping road network communities for guiding disease surveillance and control strategies date: 2018-03-16 journal: sci rep doi: 10.1038/s41598-018-22969-4 sha: doc_id: 2929 cord_uid: oqe3gjcs human mobility is increasing in its volume, speed and reach, leading to the movement and introduction of pathogens through infected travelers. an understanding of how areas are connected, the strength of these connections and how this translates into disease spread is valuable for planning surveillance and designing control and elimination strategies. while analyses have been undertaken to identify and map connectivity in global air, shipping and migration networks, such analyses have yet to be undertaken on the road networks that carry the vast majority of travellers in low and middle income settings. here we present methods for identifying road connectivity communities, as well as mapping bridge areas between communities and key linkage routes. we apply these to africa, and show how many highly-connected communities straddle national borders and when integrating malaria prevalence and population data as an example, the communities change, highlighting regions most strongly connected to areas of high burden. the approaches and results presented provide a flexible tool for supporting the design of disease surveillance and control strategies through mapping areas of high connectivity that form coherent units of intervention and key link routes between communities for targeting surveillance. networks, the regular and planar nature of road networks precludes the formation of clear communities, i.e. roads that cluster together shaping areas that are more connected within their boundaries than with external roads. highly connected regional communities can promote rapid disease spread within them, but can be afforded protection from recolonization by surrounding regions of reduced connectivity, making them potentially useful intervention or surveillance units 6, 26, 27 . for isolated areas, a focused control or elimination program is likely to stand a better chance of success than those highly connected to high-transmission or outbreak regions. for example, reaching a required childhood vaccination coverage target in one district is substantially more likely to result in disease control and elimination success if that district is not strongly connected to neighbouring districts where the target has not been met. the identification of 'bridge' routes between highly connected regions could also be of value in targeting limited resources for surveillance 28 . moreover, progressive elimination of malaria from a region needs to ensure that parasites are not reintroduced into areas that have been successfully cleared, necessitating a planned strategy for phasing that should be informed by connectivity and mobility patterns 26 . here we develop methods for identifying and mapping road connectivity communities in a flexible, hierarchical way. moreover, we map 'bridge' areas of low connectivity between communities and apply these new methods to the african continent. finally, we show how these can be weighted by data on disease prevalence to better understand pathogen connectivity, using p. falciparum malaria as an example. african road network data. data on the african road network (arn) were obtained from gps navigation and cartography as described in a previous study 24 . the dataset maps primary and secondary roads across the continent, and while it does have commercial restrictions, it is a more complete and consistent dataset than alternative open road datasets (e.g. openstreetmap 29 , groads 30 ). visual inspection and comparison between the arn and other spatial road inventories validated the improved accuracy and consistency of arn, however a quantitative validation analysis was not possible due to the lack of consistent ground-truth data at continental scales. figure 1a shows the african road network data used in this analysis. the road network dataset is a commercial restricted product and requests for it can be directly addressed to garmin 31 . plasmodium falciparum malaria prevalence and population maps. to demonstrate how geographically referenced data on disease occurrence or prevalence can be integrated into the approaches outlined, gridded data on plasmodium falciparum malaria prevalence were obtained from the malaria atlas project (http:// www.map.ox.ac.uk/). these represent modelled estimates of the prevalence of p. falciparum parasites in 2015 per 5 × 5 km grid square across africa 32 . additionally, gridded data on estimated population totals per 1 × 1 km grid square across africa in 2015 were obtained from the worldpop program (http://www.worldpop.org/). the population data were aggregated to the same 5 × 5 km gridding as the malaria data, and then multiplied together to obtain estimates of total numbers of p. falciparum infections per 5 × 5 km grid square. detecting communities in the african road network. we modeled the arn as a'primal' road network, where roads are links and road junctions are nodes 33 . spatial road networks have, as any network embedded in two dimensions, physical spatial constraints that impose on them a grid-like structure. in fact, the arn primal network is composed of 300, 306 road segments that account for a total length of 2, 304, 700 km, with an average road length of 7.6 km ± 13.2 km. such large standard deviations, as already observed elsewhere 23, 24, 34 , are due to the long tailed distribution of road lengths, as illustrated in fig. 1c . another property of road network structure is the frequency distribution of the degree of nodes, defined as the number of links connected to each node. most networks in nature and society have a long tail distribution of node degree, implying the existence of hubs (nodes that connect to a large amount of other nodes) 21 , with the majority of nodes connecting to very few others. for road networks, however, the degree distribution strongly peaks around 3, indicating that most of the roads are connected with two other roads. the long tail distribution of the length of road segments, coupled with the peaked degree distribution, indicates the presence of translational invariant grid-like structure, in which road density smoothly varies among regions while their connectivity and structure does not. within such gridlike structures it is very difficult to identify clustered communities, i.e. groups of roads that are more connected within themselves than to other groups. this observation is confirmed by the spatial distribution of betweenness centrality (bc), which measures the amount of time the shortest paths between each couple of nodes pass through a road. the probability distribution of bc is long tailed (fig. 1d) , while its spatial distribution spreads across the entire network, with a structural backbone form, as shown in fig. 1b. again, under such conditions and because of the absence of bottlenecks, any strategy to detect communities that employs pruning on bc values 35 , will be minimally effective. to detect communities in road networks we follow the observation that human displacement in urban networks is guided by straight lines 36 . therefore, geometry can be used to detect communities of roads by assuming that people tend to move more along streets than between between streets. we developed a community detection pipeline that converts a primal road network, where roads are links and roads junction are nodes 33 , to a dual network representation, where link are nodes and street junction link between nodes 37 , by mean of straightness and contiguity of roads. it is important to note here that the units of analysis are road segments, which here are typically short and straight between intersections, making the straightness assumption valid. community detection in the dual network is then performed using a modularity optimization algorithm 38 . the communities found in the dual network are then mapped back to the original primal road network. these communities encode information about the geometry of road pattern but can also incorporate weights associated with a particular disease to guide the process of community detection. nodes in the dual network represent lines in the primal network. the conversion from primal to dual is done by using a modified version of the algorithm known as continuity negotiation 37 . in brief, we assume that a pair of adjacent edges belongs to the same street if the angle θ between these edges is smaller than θ c = 30°. we also assume that the angle between two adjacent edges (i, j) and (j, p) is given by the dot product cos (θ) = r i, j r j,p /r i, j r j,p , where r i, j = r j r i . under these assumptions, the angle between two edges belonging to a perfect straight line is zero, while it assumes a value of 90° for perpendicular edges. our algorithm starts searching for the edge that generates the longest road in the primal space, as can be seen in fig. 2a . then, a node is created in the dual space and assigned to this road. next, we search for the edge that generates the second longest road, and a new node is created in the dual space and assigned to this road. if there is at least one interception between the new road and the previous one, we connect the respective nodes in the dual space. the algorithm continues until all the edges in the primal space are assigned to a node in the dual space, as shown in fig. 2b . note that the conversion from primal to the dual road network has been used extensively to estimate human perception and movement along road networks (space syntax, see 36 ) , which also supports our use of road geometry to detect communities. despite the regular structure of the network in the primal space, the topology of these networks in the dual space is very rich. for instance the degree distribution in dual space follows the power-law p(k) k −γ . this property has been previously identified in urban networks 33 and it is strongly related to the long tailed distribution of road lengths in these networks (see fig. 1c ). since most of the roads are short, most of the nodes in dual space will have a small number of connections. on the other hand, there are a few long roads (fig. 2a ) that originate at hubs in the dual space (fig. 2b ). our approach for detecting communities in road networks consists then in performing classical community detection in the dual representation ( fig. 2c) and then bringing the result back to the primal representation, as shown in fig. 2d . the algorithm used to detect the communities is the modularity-based algorithm by clauset and newman 35 . the hierarchical mapping of communities on the african road network, with outputs for 10, 20, 30 and 40 sets of communities, is shown in fig. 3 . the maps highlight how connectivity rarely aligns with national borders, with the areas most strongly connected through dense road networks typically straddling two or more countries. the hierarchical nature of the approach is illustrated through the breakdown of the 10 large regions in fig. 3a into further sub-regions in b, c and d, emphasizing the main structural divides within each region in mapped in 3a. some large regions appear consistently in each map, for example, a single community spans the entire north african coast, extending south into the sahara. south africa appears as wholly contained within a single community, while the horn of africa containing somalia and much of ethiopia and kenya in consistently mapped as one community. the four maps shown are example outputs, but any number of communities can be identified. the clustering that maximises modularity produces 104 communities, and these are mapped in fig. 4 . even with division into 104 communities, the north africa region remains as a single community, strongly separated from sub-saharan africa by large bridge regions. south africa also remains as almost wholly within its own community, with somalia and namibia showing similar patterns. the countries with the largest numbers of communities tend to be those with the least dense infrastructure equating to poor connectivity, such as drc and angola, though west africa also shows many distinct clusters, especially within nigeria. apart from the sahara, the largest bridge regions of poor connectivity are located across the central belt of sub-saharan africa, where population densities are low and transport infrastructure is both sparse and often poor. the communities mapped in figs 3 and 4 align in many cases with recorded population and pathogen movements. for example, the broad southern and eastern community divides match well those seen in hiv-1 subtype analyses 12 and community detection analyses based on migration data 27 . at more regional scales, there also exist similarities with prior analyses based on human and pathogen movement patterns. for example, the western, coastal and northern communities within kenya in fig. 4b , identified previously through mobile phone and census derived movement data 39, 40 . further, guinea, liberia and sierra leone typically remain mostly within a single community in fig. 3 , with some divides evident in fig. 4c . this shows some strong similarities with the spread of ebola virus through genome analysis 15 , particularly the multiple links between rural guinea and sierra leone, though fig. 4c highlights a divide between the regions containing conakry and freetown when africa is broken into the 104 communities. figure 3 highlights the connections between kinshasa in western drc and angola, with the recent yellow fever outbreak spreading within the communities mapped. figure 4d shows the'best' communities map for an area of southern africa, and the strong cross-border links between swaziland, southern mozambique and western south africa are mapped within a single community, as well as wider links highlighted in fig. 3 , matching the travel patterns found from swaziland malaria surveillance data 41 . integrating p. falciparum malaria prevalence and population data with road networks for weighted community detection. the previous section outlined methods for community detection on unweighted road networks. to integrate disease occurrence, prevalence or incidence data for the identification of areas of likely elevated movement of infections or for guiding the identification of operational control units, an adaptation to weighted networks is required. we demonstrate this through the integration of the data on estimated numbers of p. falciparum infections per 5 × 5 km grid square into the community detection pipeline. the final pipeline for community detection calculated a trade-off between form and function of roads in order to obtain a network partition. the form is related to the topology of the road network and is taken into account during the primal-dual conversion. the topological component guarantees that only neighbor and well connected locations could belong to the same community. the functional part, on the other hand, is calculated by the combination of estimated p. falciparum malaria prevalence multiplied by population to obtain estimated numbers of infections, as outlined above. the two factors were combined to form a weight to each edge of our primal network. the weight w i, j of edge (i, j) is defined as where m(r) is the p. falciparum malaria prevalence and p(r) is the population count, both at coordinate r. these values are obtained directly from the data. when the primal representation is converted into its dual version, the weights of primal edges, given by eq. 1, are converted into weights of dual nodes, which are defined as where i represents the i th dual node and ω i represents the set of all the primal edges that were combined together to form the dual node i (see fig. 2a,b) . finally, weights for the dual edges are created from the weights of dual nodes, by simply assuming the dual network weighted by values of λ i,¯j was used as input for a weighted community detection algorithm. ultimately, when the communities detected in the dual space are translated back to primal space, we have that neighbor locations with similar values of estimated p. falciparum infections belong to the same communities. for the example of p. falciparum malaria used here, the max function was used, representing maximum numbers of infections on each road segment in 2015. this was chosen to identify connectivity to the highest burden areas. areas with large numbers of infections are often 'sources' , with infected populations moving back and forward from them spreading parasites elsewhere 6, 42 . therefore, mapping which regions are most strongly connected to them is of value. alternative metrics can be used however, depending on the aims of the analyses. the integration of p. falciparum malaria prevalence and population (fig. 5a ) through weighting road links by the maximum values across them produces a different pattern of communities (fig. 5b) to those based solely on network structure (fig. 3) . the mapping of 20 communities is shown here, as it identifies key regions of known malaria connectivity, as outlined below. the mapping shows areas of key interest in malaria elimination efforts connected across national borders, such as much of namibia linked to southern angola 43 , but the zambezi region of namibia more strongly linked to the community encompassing neighbouring zambia, zimbabwe and botswana 44 . in namibia, malaria movement communities identified through the integration of mobile phone-based movement data and case-based risk mapping 26 show correspondence in mapping a northeast community. moreover, swaziland is shown as being central to a community covering, southern mozambique and the malaria endemic regions of south africa, matching closely the origin locations of the majority of internationally imported cases to swaziland and south africa 41, 45, 46 . the movements of people and malaria between the highlands and southern and western regions of uganda, and into rwanda 47 , also aligns with the community patterns shown in fig. 5b . finally, though quantifying different factors, the analyses show a similar east-west split to that found in analyses of malaria drug resistance mutations 6, 48 and malaria movement community mapping 27 . the emergence of new disease epidemics is becoming a regular occurrence, and drug and insecticide resistance are continuing to spread around the world. as global, regional and local efforts to eliminate a range of infectious diseases continue and are initiated, an improved understanding of how regions are connected through human transport can therefore be valuable. previous studies have shown how clusters of connectivity exist within the global air transport network 49, 50 and shipping traffic network 50 , but these represent primarily the sources of occasional long-distance disease or vector introductions 1, 8 , rather than the mode of transport that the majority of the population uses regularly. the approaches presented here focused on road networks provide a tool for supporting the design of disease and resistance surveillance and control strategies through mapping (i) areas of high connectivity where pathogen circulation is likely to be high, forming coherent units of intervention; (ii) areas of low connectivity between communities that form likely natural borders of lower pathogen exchange; (iii) key link routes between communities for targetting surveillance efforts. the outputs of the analyses presented here highlight how highly connected areas consistently span national borders. with infectious disease control, surveillance, funding and strategies principally implemented country by country, this emphasises a mismatch in scales and the need for cross-border collaboration. such collaborations are being increasingly seen, for example with countries focused on malaria elimination (e.g. 51, 52 ), but the outputs here show that the most efficient disease elimination strategies may need to reconsider units of intervention, moving beyond being constrained by national borders. results from the analysis of pathogen movements elsewhere confirm these international connections (e.g. 6, 12, 41, 48 , building up additional evidence on how pathogen circulation can be substantially more prevalent in some regions than others. the approaches developed here provide a complement to other approaches for defining and mapping regional disease connectivity and mobility 9 . previously, census-based migration data has been used to map blocks of countries of high and low connectivity 27 , but these analyses are restricted to national-scales and cover only longer-term human mobility. efforts are being made to extend these to subnational scales 53, 54 , but they remain limited to large administrative unit scales and the same long timescales. mobile phone call detail records (cdrs) have also been used to estimate and map pathogen connectivity 26, 40 , but the nature of the data mean that they do not include cross-border movements, so remain limited to national-level studies. an increasing number of studies are uncovering patterns in human and pathogen movements and connectivity through travel history questionnaires (e.g. 41, 47, 55, 56 ), resulting in valuable information, but typically limited to small areas and short time periods. there exist a number of limitations to the methods and outputs presented here that future work will aim to address. firstly, the hierarchies of road types are not currently taken into account in the network analyses, meaning that a major highway and small local roads contribute equally to community detection and epidemic spreading. the lack of reliable data on road typologies, and inconsistencies in classifications between countries, makes this challenging to incorporate however. moreover, the relative importance of a major road versus secondary, tertiary and tracks is exceptionally difficult to quantify within a country, let alone between countries and across africa. finally, data on seasonal variations in road access does not exist consistently across the continent. our focus has therefore been on connectivity, in terms of how well regions are connected based on existing road networks, irrespective of the ease of travel. a broader point that deserves future research is that while intuition suggests a correspondence in most places, connectivity may not always translate into human or pathogen movement. future directions for the work presented here include quantitative comparison and integration with other connectivity data, the integration of different pathogen weightings, and the extension to other regions of the world. qualitative comparisons outlined above show some good correspondence with analyses of alternative sources of connectivity and disease data. a future step will be to compare these different connections and communities quantitatively to examine the weight of evidence for delineating areas of strong and weak connectivity. this could potentially follow similar studies looking at community structure on weighted networks, such as in the us based on commuting data 57 , or uk and belgium from mobile network data 58, 59 . here, p. falciparum malaria was used to provide an example of the potential for weighting analyses by pathogen occurrence, prevalence, incidence or transmission suitability. moreover, future work will examine the integration of alternative pathogen weightings. the maximum difference method was used here to pick out regions well connected to areas high p. falciparum burden, but the potential exists to use different weighting methods depending on requirements, strategic needs, and the nature of the pathogen being studied. despite the rapid growth of air travel, shipping and rail in many parts of the world, roads continue to be the dominant route on which humans move on sub-national, national and regional scales. they form a powerful force in shaping the development of areas, facilitating trade and economic growth, but also bringing with them the exchange of pathogens. results here show that their connectivity is not equal however, with strong clusters of high connectivity separated by bridge regions of low network density. these structures can have a significant impact on how pathogens spread, and by mapping them, a valuable evidence base to guide disease surveillance as well as control and elimination planning can be built. results were produced through four main phases. phase 1: road network cleaning and weighted adjacency list production: the road cleaning operation aimed to produce a road network from the georeferenced vectorial network of roads infrastructure. this phase was conducted using esri arcmap 10.4 (http://desktop.arcgis.com/en/ arcmap/) through the use of the topological cleaning tool. the tool integrates contiguous roads, removes very short links and removes overlapping road segments. road junctions were created using the polyline to node conversion tool, while road-link association was computed using the spatial join tool. malaria prevalence values were assigned to each road using the spatial join tool. the adjacency matrix output, containing also the coordinates for each road junctions, was extracted in form of text file. phase 2: conversion from the primal to the dual network: the primal network created in phase 1 was then used as input for a continuity negotiation-like algorithm. the goal of this algorithm was to translate the primal network into its dual representation (see fig. 2a,b) . the implementation of the negotiation-like algorithm used the igraph library in c++ (http://igraph.org/c/) on an octa-core imac. the conversion took around 20 hours for a primal network with ~200 k nodes running. the algorithm works by first identifying roads composed of many contiguous edges in the primal space. two primal-edges are assumed to be contiguous if the angle between them is not greater than 30° degrees. because the dual representation generated by the algorithm strongly depends on the starting edge, we started by looking for the edge that produces the longest road. as soon as this edge was found, a dual-node was created to represent that road. next we proceeded to look for the edge that produced the second longest road and create a dual-node for that road. we continued this process until every primal-edge had been assigned to a road. finally, dual-nodes were connected to each other if their primal counterparts (roads) crossed each other in the primal space. phase 3: community detection: we used a traditional modularity optimization-based algorithm to identify communities in the dual representation of the road network. the modularity metrics were computed in r using the igraph library (http://igraph.org/r/). to incorporate the prevalence of malaria, we used the malaria prevalence values as edge weights for community detection. phase 4: mapping communities. detected communities were mapped back to the primal road network with the use of the spatial join tool in arcmap. all maps were produced in arcmap. global transport networks and infectious disease spread severe acute respiratory syndrome h5n1 influenza-continuing evolution and spread geographic dependence, surveillance, and origins of the 2009 influenza a (h1n1) virus the global tuberculosis situation and the inexorable rise of drug-resistant disease the transit phase of migration: circulation of malaria and its multidrug-resistant forms in africa population genomics studies identify signatures of global dispersal and drug resistance in plasmodium vivax air travel and vector-borne disease movement mapping population and pathogen movements unifying viral genetics and human transportation data to predict the global transmission dynamics of human influenza h3n2 the blood dna virome in 8,000 humans spatial accessibility and the spread of hiv-1 subtypes and recombinants the early spread and epidemic ignition of hiv-1 in human populations spread of yellow fever virus outbreak in angola and the democratic republic of the congo 2015-16: a modelling study virus genomes reveal factors that spread and sustained the ebola epidemic commentary: containing the ebola outbreak-the potential and challenge of mobile network data world development report 2009: reshaping economic geography population distribution, settlement patterns and accessibility across africa in 2010 the structure of transportation networks elementary processes governing the evolution of road networks urban street networks, a comparative analysis of ten european cities the scaling structure of the global road network street centrality and densities of retail and services in bologna integrating rapid risk mapping and mobile phone call record data for strategic malaria elimination planning international population movements and regional plasmodium falciparum malaria elimination strategies cross-border malaria: a major obstacle for malaria elimination information technology outreach services -itos-university of georgia. global roads open access data set, version 1 (groadsv1) the effect of malaria control on plasmodium falciparum in africa between the network analysis of urban streets: a primal approach random planar graphs and the london street network. the eur finding community structure in very large networks networks and cities: an information perspective the network analysis of urban streets: a dual approach modularity and community structure in networks the use of census migration data to approximate human movement patterns across temporal scales quantifying the impact of human mobility on malaria travel patterns and demographic characteristics of malaria cases in swaziland human movement data for malaria control and elimination strategic planning malaria risk in young male travellers but local transmission persists: a case-control study in low transmission namibia the path towards elimination reviewing south africa's malaria elimination strategy (2012-2018): progress, challenges and priorities targeting imported malaria through social networks: a potential strategy for malaria elimination in swaziland association between recent internal travel and malaria in ugandan highland and highland fringe areas multiple origins and regional dispersal of resistant dhps in african plasmodium falciparum malaria the worldwide air transportation network: anomalous centrality, community structure, and cities' global roles the complex network of global cargo ship movements asian pacific malaria elimination network mapping internal connectivity through human migration in malaria endemic countries census-derived migration data as a tool for informing malaria elimination policy key traveller groups of relevance to spatial malaria transmission: a survey of movement patterns in four subsaharan african countries infection importation: a key challenge to malaria elimination on bioko island, equatorial guinea an economic geography of the united states: from commutes to megaregions redrawing the map of great britain from a network of human interactions uncovering space-independent communities in spatial networks e.s., m.p.v. and a.j.t. conceived and designed the analyses. e.s. and m.p.v. designed the road network community mapping methods and undertook the analyses. all authors contributed to writing and reviewing the manuscript. competing interests: the authors declare no competing interests.publisher's note: springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution 4.0 international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. key: cord-024830-cql4t0r5 authors: mcmillin, stephen edward title: quality improvement innovation in a maternal and child health network: negotiating course corrections in mid-implementation date: 2020-05-08 journal: j of pol practice & research doi: 10.1007/s42972-020-00004-z sha: doc_id: 24830 cord_uid: cql4t0r5 this article analyzes mid-implementation course corrections in a quality improvement innovation for a maternal and child health network working in a large midwestern metropolitan area. participating organizations received restrictive funding from this network to screen pregnant women and new mothers for depression, make appropriate referrals, and log screening and referral data into a project-wide data system over a one-year pilot program. this paper asked three research questions: (1) what problems emerged by mid-implementation of this program that required course correction? (2) how were advocacy targets developed to influence network and agency responses to these mid-course problems? (3) what specific course corrections were identified and implemented to get implementation back on track? this ethnographic case study employs qualitative methods including participant observation and interviews. data were analyzed using the analytic method of qualitative description, in which the goal of data analysis is to summarize and report an event using the ordinary, everyday terms for that event and the unique descriptions of those present. three key findings are noted. first, network participants quickly responded to the emerged problem of under-performing screening and referral completion statistics. second, they shifted advocacy targets away from executive appeals and toward the line staff actually providing screening. third, participants endorsed two specific course corrections, using “opt out, not opt in” choice architecture at intake and implementing visual incentives for workers to track progress. opt-out choice architecture and visual incentives served as useful means of focusing organizational collaboration and correcting mid-implementation problems. this study examines inter-organizational collaboration among human service organizations serving a specific population of pregnant women and mothers at risk for perinatal depression. these organizations received restrictive funding from a local community network to screen this population for risk for depression, make appropriate referrals as indicated, and log screening and referral data into a project-wide data system for a 1-year pilot program. this paper asked three specific research questions: (1) what problems emerged by mid-implementation of the screening and referral program that required course correction? (2) how were advocacy targets developed to influence network and agency responses to these mid-course problems? (3) what specific course corrections were identified and implemented to get implementation back on track? previous scholarship (mcmillin 2017) reported the background of how the maternal and child health organization studied here began as a community committee funded by the state legislature to address substance use by pregnant women and new mothers. ultimately this committee grew into a 501(c)3 nonprofit backbone organization (mcmillin 2017) that increasingly served as a pass-through entity for many grants it administered and dispersed to health and social service agencies who were members and partners of the network and primarily served families with young children. one important grant was shared with six network partner agencies to create a pilot program and data-sharing system for a universal screening and referral protocol for perinatal mood and anxiety disorders. this innovation used a network-wide shared data software system into which staff from all six partner agencies entered their screening results and the referrals they gave to clients. universal screening and referral for perinatal mood and anxiety disorders and cooccurring issues meant that every partner agency would do some kind of screening (virtually always an empirically validated instrument such as the edinburgh postnatal depression scale), and every partner would respond to high screens (scores over the designated clinical cutoff of whatever screening instrument was being used, indicating the presence of or high risk for clinical depression) with a referral to case managers in partner agencies that were also funded by the network. the funded partners faced a very tight timeline that anticipated regular screening and enrollment of an estimated number of clients in case management and depression treatment for every month of the fiscal program year. a slow start in screening and enrolling patients meant that funded partners would likely be in violation of their grant contract with the network while facing a rapidly closing window of time in which they would be able to catch up and provide enough contracted services to meet the contractual numbers for their catchment area, which could jeopardize funding for a second year. this paper covers the 4 months in the middle of the pilot program year when network staff realized that funded partners were seriously behind schedule in the amount of screens and referrals for perinatal mood and depression these agencies were contracted to make at this point in the fiscal year. although challenging and complex for many human service organizations, collaboration with competitors in the form of "co-opetive relationships" has been linked to greater innovation and efficiency (bunger et al. 2017, p. 13) . but grant cycle funding can add to this complexity in the form of the "capacity paradox," in which small human service organizations working with specific populations face funding restrictions because they are framed as too small or lacking capacity for larger scale grants and initiatives (terrana and wells 2018, p. 109) . finally, once new initiatives are implemented in a funded cycle, human service organizations are increasingly expected to engage in extensive, timely, and often very specific data collection to generate evidence of effectiveness for a particular program (benjamin et al. 2018) . mid-course corrections during implementation of prevention programs targeted to families with young children have long been seen as important ways to refine and modify the roles of program staff working with these populations and add formal and informal supports to ongoing implementation and service delivery prior to final evaluation (lynch et al. 1998; wandersman et al. 1998) . mid-course corrections can help implementers in current interventions or programs adopt a more facilitative and individualized approach to participants that can improve implementation fidelity and cohesion (lynch et al. 1998; sobeck et al. 2006) . comprehensive reviews of implementation of programs for families with young children have consistently found that well-measured implementation improves program outcomes significantly, especially when dose or service duration is also assessed (durlak and dupre 2008; fixsen et al. 2005) . numerous studies have emphasized capturing implementation data at a low enough level to be able to use it to improve service data quickly and hit the right balance of implementation fidelity and thoughtful, intentional implementation adaptation (durlak and dupre 2008; schoenwald et al. 2010; schoenwald et al. 2013; schoenwald and hoagwood 2001; tucker et al. 2006 ). inter-organizational networks serving families with young children face special challenges in making mid-course corrections while maintaining implementation fidelity across member organizations (aarons et al. 2011; hanf and o'toole 1992) . implementation through inter-organizational networks is never merely a result of clear success or clear failure; rather, it is an ongoing assessment of how organizational actors are cooperating or not across organizational boundaries (cline 2000) . frambach and schillewaert (2002) echo this point by noting that intra-organizational and individual cooperation, consistency, and variance also have strong effects on the eventual level of implementation cohesion and fidelity that a given project is able to reach. moreover, recent research suggests that while funders and networks may emphasize and prefer inter-organizational collaboration, individual agency managers in collaborating organizations may see risks and consequences of collaboration and may face dilemmas in complying with network or funder expectations (bunger et al. 2017) . similar organizations providing similar services with overlapping client bases may fear opportunism or poaching from collaborators, and interpersonal trust as well as contracts or memoranda of understanding might be needed to assuage these concerns (bunger 2013; bunger et al. 2017) . even successful collaboration may expedite mergers between collaborating organizations that are undesired or disruptive to stakeholders and sectors (bunger 2013) . while funders may often prefer to fund larger and more comprehensive merged organizations, smaller specialized community organizations founded by and for marginalized populations may struggle to maintain their community connection and focus as subordinate components of larger firms (bunger 2013) . organizational policy practice and advocacy for mid-course corrections in a pilot program likely looks different from the type of advocacy and persuasion efforts that might seek to gain buy-in for initial implementation of the program. fischhoff (1989) notes that in the public health workforce, individual workers rarely know how to organize their work to anticipate the possibility or likelihood of mid-course corrections because most work is habituated and routinized to the point that it is rarely intentionally changed, and when it is changed, it is due to larger issues on which workers expect extensive further guidance. when a need for even a relatively minor mid-course correction is identified, it can result in everyone concerned "looking bad," from the workers adapting their implementation to the organizations requesting the changes (fischhoff 1989, p. 112) . there is also some evidence that health workers have surprisingly stable, pre-existing beliefs about their work situations and experiences, and requests to make mid-course corrections in work situations may have to contend with workers' pre-existing, stable beliefs about the program they are implementing no matter how well-reasoned the proposed course corrections are (harris and daniels 2005) . given a new emphasis in social work that organizational policy advocacy should be re-conceptualized as part of everyday organizational practice (mosley 2013) , a special focus on strategies that contribute to the success of professional networks and organizations that can leverage influence beyond that of a single agency becomes increasingly important. given the above problems with inter-organizational collaboration, increased attention has turned to automated methods of implementation that reduce burden on practitioners without unduly reducing freedom of choice and action. behavioral economics and behavioral science approaches have been suggested as ways to assist direct practitioners to follow policies and procedures that they are unlikely to intend to violate. evidence suggests that behavior in many contexts is easy to predict with high accuracy, and behavioral economics seeks to alter people's behavior in predictable, honest, and ethical ways without forbidding any options or significantly adding too-costly incentives, so that the best or healthiest choice is the easiest choice (thaler and sunstein 2008) . following mosley's (2013) recommendation, this paper examines in detail how a heavily advocated quality improvement pilot program for a maternal and child health network working in a large midwestern metropolitan area attempted to make mid-implementation course corrections for a universal screening and referral program for perinatal mood and anxiety disorders conducted by its member agencies. this paper answers the call of recent policy practice and advocacy research to examine how "openness, networking, sharing of tasks," and building and maintaining positive relationships are operative within organizational practice across multiple organizations (ruggiano et al. 2015, p. 227 ). additionally, this paper focuses on extending recent research to understand how mandated screening for perinatal mood and anxiety disorders can be implemented well (yawn et al. 2015) . this study used an ethnographic case study method because treating the network and this pilot program as a case study makes it possible to examine unique local data while also locating and investigating counter-examples to what was expected locally (stake 1995). this method makes it possible to inform and modify grand generalizations about the case before such generalizations become widely accepted (stake 1995). this study also used ethnographic methods such participant observation and informal, unstructured interview conversations at regularly scheduled meetings. adding these ethnographic approaches to a case study which is tightly time-limited can help answer research questions fully and efficiently (fusch et al. 2017 ). data were collected at regular network meetings, which are 2-3 h long and held twice a month. one meeting is a large group of about 30 participants who supervise or perform screening and case management for perinatal mood and anxiety disorders as well as local practitioners in quality improvement and workforce training and development. a second executive meeting was held with 8-12 participants, typically network staff and the two co-chairs of each of organized three groups, a screening and referral group, a workforce training group, and a quality improvement group, to debrief and discuss major issues reported at the large group meeting. for this study, the author served as a consultant to the quality improvement group and attended and took notes on network meetings in the middle of the program year (november through february) to investigate how mid-course corrections in the middle of the contract year were unfolding. these network meetings generally used a world café focus group method, in which participants move from a large group to several small groups discussing screening and referral, training, and quality improvement specifically, then moved back to report small group findings to the large group (fouché and light 2011) . the author typed extensive notes on an ipad, and notetaking during small group breakouts could only intermittently capture content due to the roving nature of the world café model. note-taking was typically unobtrusive because virtually all participants in both small and large group meetings took notes on discussion. note-taking and note-sharing were also a frequent and iterative process, in which the author commonly forwarded the notes taken at each meeting to network staff and group participants after each meeting to gain their insights and help construct the agenda of the next meeting. by the middle of the program year, network participants had gotten to know each other and the author quite well, so the author was typically able to easily arrange additional conversations for purposes of member checking. these informal meetings supplemented the two regular monthly meetings of the network and allowed for specific follow-up in which participants were asked about specific comments and reactions they had shared at previous meetings. brief powerpoint presentations were also used at the beginning of successive meetings during the program year to summarize announcements and ideas from the last meeting and encourage new discussion. often, powerpoints were used to remind participants of dates, deadlines, statistics, and refined and summarized concepts. because so many other meeting participants took their own notes and shared them during meetings, a large amount of feedback on meeting topics and their meaning were able to be obtained. the author then coded the author's meeting notes in an iterative and sequenced process guided by principles of qualitative description, in which the goal of data analysis is to summarize and report an event using the ordinary, everyday terms for that event and the unique descriptions of those present (sandelowski and leeman 2012) . this analytic method was chosen because it is especially useful when interviewing health professionals about a specific topic, in that interpretation stays very close to the data presented while leveraging all of the methodological strengths of qualitative research, such as multiple, iterative coding, member checking, and data triangulation (neergaard et al. 2009 ). in this way, qualitative organizational research remains rigorous, while the significance of findings is easily translated to wider audiences for rapid action in intervention and implementation (sandelowski and leeman 2012) . by the middle of the program year, network meeting participants explicitly recognized that mid-course corrections were needed in the implementation of the new quality improvement and data-sharing program for universal screening and referral of perinatal mood and anxiety disorders. after iterative analysis of shared meeting notes, three key challenges were salient as themes from network meetings in the middle of the program year. regarding the first research question, concerning what problems emerged by midimplementation that required course correction, data showed that the numbers of clients screened and referred were a fraction of what was contractually anticipated by midway through the program year. this problem was two-fold, in that fewer screenings than expected were reported, but also data showed that those clients who screened as at risk for a perinatal mood and anxiety disorder were not consistently being given the referrals to further treatment indicated by the protocol. this was the first time the network had seen "real numbers" from the collected data for the program year that could be compared with estimated and predicted numbers for each part of the program year, both in terms of numbers anticipated to be screened and especially in terms of numbers expected to be referred to the case management services being funded by the network. however, the numbers were starkly disappointing: only about half of those whose screening scores were high enough to trigger referrals were actually offered referrals, and only about 2/3 of those who received referrals actually accepted the referral and followed up for further care. by the middle of the program year, only 16% of expected, estimated referrals had been issued, and no network partner was at the 50% expected. in responding to this data presentation, participants offered several possible patientlevel explanations. first, several noted that patients commonly experience inconsistent providers during perinatal care and may have little incentive to follow up on referrals after such a fragmented experience. one participant noted a patient who had been diagnosed with preeclampsia (a pregnancy complication marked by high blood pressure) by her first provider, but the diagnostician gave no further information beyond stating the diagnosis, and then the numerous other providers this patient saw never mentioned it again. this patient moved through care knowing nothing about her diagnosis and with little incentive to accept or follow up with other referrals. other participants noted that the typical approach to discharge planning and care transitions by providers was a poor match for clean, universal screening and referral, and that satisfaction surveys had captured patient concerns about how they were given information, which was typically on paper and presented to the patient as she leaves the hospital or medical office. as one participant noted, "we flood them the day mom leaves the hospital and we're lucky if the paper ever gets out of the back seat of the car." others noted that patients may not follow up on referrals simply because they are feeling overwhelmed with other areas of life or are feeling emotionally better without further treatment. however, while these explanations may shed light on why referred patients did not follow up on or keep referrals, they do nothing to explain why no referral or follow-up was offered for screens that were above the referral cutoff. two further explanations were salient. one explanation centered on the idea that some positive screens were potentially being ignored because staff may be reluctant to engage or feared working with clients in crisis-described as an, "if i don't ask the question, i don't have to deal with it" mindset. all screening tools used numeric scores, so that triggered referrals were not dependent on staff having to decide independently to make a referral, but conveying the difficult news that a client had scored high enough on a depression scale to warrant a follow-up referral may have been daunting to some staff. an alternative explanation suggested that staff were not ignoring positive screens but were not understanding the intricacies and expectations of the screening process. of the community agencies partnering with the network to provide screening, many were also able to provide case management as well, but staff did not realize that internal referrals to a different part of their agency still needed to be documented. in this case, a number of missed referrals could have been provided but never documented in the network datasharing system. regarding the second research question, concerning how advocacy targets needed to change based on the identification of the problem, participants agreed that the previous plan to reinforce the importance of the screening program to senior executives in current and potential partner agencies (mcmillin 2017) needed to be updated to reflect a much tighter focus on the line staff actually doing the work (or alternatively not doing the work in the ways expected) in the months remaining in the funded program year. one participant noted that the elusive warm handoff-a discharge and referral where the patient was warmly engaged, clearly understood the expected next steps, and was motivated to visit the recommended provider for further treatment-was also challenging for staff who might default to a "just hand them the paper" mindset, especially for those staff who were overwhelmed and understaffed. the network was funding additional case managers to link patients to treatment, but partner agencies were expected to screen using usual staff, who had been trained but not increased or otherwise compensated to do the screening. additional concerns mentioned the knowledge and preparation of staff to make good referrals, with an example noted of one staff member seemingly unaware of how to make a domestic violence referral even through a locally well-known agency specializing in interpersonal violence treatment and prevention has been working with the network for some time. meeting participants agreed that in the time remaining for the screening and referral pilot, advocacy efforts would have to be diverted away from senior executives and toward line staff if there was to be any chance of meeting enrollment targets and justifying further funding for the screening and referral program. participants also noted that while the operational assumption was that agencies that were network partners this pilot year would remain as network partners for future years of the universal screening and referral program, there was no guarantee about this. partner agencies that struggled to complete the pilot year of the program, with disappointing numbers, may decline to participate next year, especially if they lost network funding based on their challenged performance in the current program year. this suggested that additional advocacy at the executive level might still be needed, as executives could lead their agencies out of the network screening system after june 30, but that for the remainder of the program year, the line staff themselves who were performing screening needed to be heavily engaged and lobbied to have any hope of fully completing the pilot program on time. regarding the third research question, concerning specific course corrections identified and implemented to get implementation back on track, a prolonged brainstorming session was held after the disappointing data were shared. this effort produced a list of eight suggested "best practices" to help engage staff performing screening duties to excel in the work: (1) making enrollment targets more visible to staff, perhaps by using visual charts and graphs common in fundraising drives; (2) using "opt-out" choice architecture that would automatically enroll patients who screened above the cutoff score unless the patient objected; (3) sequencing screens with other paperwork and assessments in ways that make sure screens are completed and acted upon in a timely way; (4) offering patients incentives for completing screens; (5) educating staff on reflective practice and compassion fatigue to avoid or reduce feelings of being overwhelmed about screening; (6) using checklists that document work that is done and undone; (7) maintaining intermittent contact and follow-up with patients to check on whether they have accepted and followed up on referrals; and (8) using techniques of prolonged engagement so that by the time staff are screening patients for perinatal mood and anxiety disorders, patients are more likely to be engaged and willing to follow up. further discussion of these best practices noted that there was no available funding to compensate either patients for participating in screening or staff for conducting screening. long-term contact or prolonged engagement also seemed to be difficult to implement rapidly in the remaining months of the program year. low-cost, rapid implementation strategies were seen as most needed, and it was noted that strategies from behavioral economics were the practices most likely to be rapidly implemented at low-cost. visual charts and graphs displaying successful screenings and enrollments while also emphasizing the remaining screenings and enrollments needed to be on schedule were chosen for further training because these tactics would involve virtually no additional cost to partner agencies and could be implemented immediately. likewise, shifting to "opt-out" enrollment procedures was encouraged, where referred patients would be automatically enrolled in case management unless they specifically objected. in addition, the network quickly scheduled a workshop on how to facilitate meetings so that supervisors in partner agencies would be assisted in immediately discussing and implementing the above course corrections and behavioral strategies with their staff. training on using visual incentives emphasized three important components of using this technique. first, it was important to make sure that enrollment goals were always visually displayed in the work area of staff performing screening and enrollment work. this could be something as simple as a hand-drawn sign in the work area noting how many patients had been enrolled compared with what that week's enrollment target was. ideally this technique would transition to an infographic that was connected to an electronic dashboard in real time-where results would be transparently displayed for all to see in an automatic way that did not take additional staff time to maintain. second, the visual incentive needed to be displayed vividly enough to alter or motivate new worker behavior, but not so vividly as to compete with, distract, or delay new worker behavior. in many social work settings, participants agreed that weekly updates are intuitive for most staff. without regular check-in's and updates of the target numbers, it could be easy for workers to lose their sense of urgency about meeting these time-constrained goals. third, training emphasized teaching staff how behavioral biases could reduce their effectiveness. many staff are used to (and often good at) cramming and working just-in-time, but this is not possible when staff cannot control all aspects of work. screeners cannot control the flow of enrollees-rather they must be ready to enroll new clients intermittently as soon as they see a screening is positive-so re-learning not to cram or work just-in-time suggested a change in workplace routines for many staff. training on "opt-out" choice architecture for network enrollment procedures emphasized using behavioral economics and behavioral science to empower better direct practitioners. evidence suggests that behavior in many contexts is easy to predict with high accuracy, and behavioral economics seeks to alter people's behavior in predictable, honest, and ethical ways without forbidding any options or significantly adding too-costly incentives, so that the best or healthiest choice is the easiest choice (thaler and sunstein 2008) . training here also emphasized meeting thaler and sunstein's (2008) two standards for good choice architecture: (1) the choice had to be transparent, not hidden, and (2) it had to be cheap and easy to opt out. good examples of such choice architecture were highlighted, such as email and social media group enrollments, where one can unsubscribe and leave such a group with just one click. bad or false examples of choice architecture were also highlighted, such as auto-renewal for magazines or memberships where due dates by which to opt out are often hidden and there is always one more financial charge before one is free of the costly enrollment or subscription. training concluded by advising network participants to use opt-in choice architecture when the services in question are highly likely to be spam, not meaningful, or only relevant to a fraction of those approached. attendees were advised to use optout choice architecture when the services in question are highly likely to be meaningful, not spam, and relevant to most of those approached. since those with positive depression screens were not approached unless they had scored high on a depression screening, automatic enrollment in a case management service where clients would receive at least one more contact from social services was highly relevant to the population served in this pilot program and was encouraged, with clients always having the right to refuse. to jump-start changing the behavior of the staff in partner agencies actually doing the screenings, making the referrals, and enrolling patients in the case management program, the network quickly scheduled a facilitation training so that supervisors and all who led staff or chaired meetings could be prepared and empowered to discuss enrollment and teach topics like opt-out enrollment to staff. this training emphasized the importance of creating spaces for staff to express doubt or confusion about what was being asked of them. one technique that resonated with participants was doing check-ins with staff during a group training by asking staff to make "fists to fives," a hand signal on a 0-5 scale on how comfortable they were with the discussion, where holding a fist in the air is discomfort, disagreement, or confusion and waving all five fingers of one hand in the air meant total comfort or agreement with a query or topic. training also emphasized that facilitators and trainers should acknowledge that conflict and disagreement typically comes from really caring, so it was important to "normalize discomfort," call it out when people in the room seem uncomfortable, and reiterate that the partner agency is able and willing to have "the tough conversations" about the nature of the work. mid-course corrections attempted during implementation of a quality improvement system in a maternal and child health network offered several insights to how organizational policy practice and advocacy techniques may rapidly change on the ground. specifically, findings highlighted the importance of checking outcome data early enough to be able to respond to implementation concerns immediately. participants broadly endorsed organizational adoption of behavioral economic techniques to influence rapidly the work behavior of line staff engaged in screening and referral before lobbying senior executives to extend the program to future years. these findings invite further exploration of two areas: (1) the workplace experiences of line staff tasked with mid-course implementation corrections, and (2) the organizational and practice experiences of behavioral economic ("nudge") techniques. this network's approach to universal screening and referral was very clearly meant to be truly neutral or even biased in the client's favor. staff were allowed and even encouraged to use their own individual judgment and discretion to refer clients for case management, even if the client did not score above the clinical cutoff of the screening instrument. mindful of the dangers of rigid, top-down bureaucracy, the network explicitly sought to empower line staff to work in clients' favor, yet still experienced disappointing results. this outcome suggests several possibilities. first, it is possible that, as participants implied, line staff were sometimes demoralized workers or nervous non-clinicians who were not eager to convey difficult news regarding high depression scores to clients who may have already been difficult to serve. as hasenfeld's classic work (hasenfeld 1992) has explicated, the organizational pull toward people-processing in lieu of true people-changing is powerful in many human service organizations. tummers (2016) also recently showcases the tendency of workers to prioritize service delivery to motivated rather than unmotivated clients. smith (2017) suggests that regulatory and contractual requirements can ameliorate disparities in who gets prioritized for what kind of human service, but the variability if human service practice makes this problem hard to eliminate altogether. however, it is also possible that line staff did not see referral as clearly in a client's best interest but rather as additional paperwork and paper-pushing within their own workplaces, additional work that line staff were given neither extra time nor compensation to complete. given that ultimately the number of internal referrals that were undercounted or undocumented was seen as an important cause of disappointing project outcomes, staff reluctance to engage in extra bureaucratic sorting tasks is a distinct possibility. the line staff here providing screening may have seen their work as less of a clinical assessment process and more of a tedious, internal bureaucracy geared toward internal compliance and payment rather than getting needy clients to worthwhile treatment. further research on the experience of line staff members performing time-sensitive sorting tasks is needed to understand how even in environments explicitly trying to be empowering and supportive of worker discretion; worker discretion may have negative impacts on desired implementation outcomes. in addition to the experience of line staff in screening clients, the interest and embrace of agency supervisors in choosing behavioral economic techniques for staff training and screening processes also deserves further study. grimmelikhuijsen et al. (2017) advocate broadly for further study and understanding of behavioral public administration which integrates behavioral economic principles and psychology, noting that whether one agrees or disagrees with the "nudge movement" (p. 53) in public administration, it is important to understand its growing influence. ho and sherman (2017) pointedly critique nudging and behavioral economic approaches, noting that they may hold promise for improving implementation and service delivery but do not focus on front-line workers, and the quality and consistency of organizational and bureaucratic services in which arbitrariness remains a consistent problem. finally, more research is needed on links between organizational policy implementation and state policy. in this case, state policy primarily set report deadlines and funding amounts with little discernible impact on ongoing organizational implementation. this gap also points to challenges in how policymakers can feasibly learn from implementation innovation in the community and how successful innovations can influence the policy process going forward. this article's findings and any inferences drawn from them must be understood in light of several study limitations. this study used a case study method and ethnographic approaches of participant observation, a process which always runs the risk of the personal bias of the researcher intruding into data collection as well as the potential for social desirability bias among those observed. moreover, a case study serves to elaborate a particular phenomenon and issue, which may limit its applicability to other cases or situations. a critical review of the use of the case study method in high-impact journals in health and social sciences found that case studies published in these journals used clear triangulation and member-checking strategies to strengthen findings and also used well-regarded case study approaches such as stake's and qualitative analytic methods such as sandelowski's (hyett et al. 2014 ). this study followed these recommended practices. continued research on health and human service program implementation that follows the criteria and standards analyzed by hyett et al. (2014) will contribute to the empirical base of this literature while ameliorating some of these limitations. research suggests that collaboration may be even more important for organizations than for individuals in the implementation of social innovations (berzin et al. 2015) . the network studied here adopted behavioral economics as a primary means of focusing organizational collaboration. however, a managerial turn to nudging or behavioral economics must do more than achieve merely administrative compliance. "opt-out, not-in" organizational approaches could positively affect implementation of social programs in two ways. first, it could eliminate unnecessary implementation impediments (such as the difficult conversations about depression referrals resisted by staff in this case) by using tools such as automatic enrollment to push these conversations to more specialized staff who could better advise affected clients. second, such approaches could reduce the potential workplace dissatisfaction of line staff, including any potential discipline they could face for incorrectly following more complicated procedures. thaler and sunstein (2003) explicitly endorse worker welfare as a rationale and site for behavioral economic approaches. they note that every system as a system has been planned with an array of choice decisions already made, and given that there is always a starting default, it should be set to predict desired best outcomes. this study supports considering behavioral economic approaches for social program implementation as a way to reset maladaptive default settings and provide services in ways that can be more just and more effective for both workers and clients. advancing a conceptual model of evidence-based practice implementation in public service sectors. administration and policy in mental health and mental health services research policy fields, data systems, and the performance of nonprofit human service organizations defining our own future: human service leaders on social innovation administrative coordination in nonprofit human service delivery networks: the role of competition and trust. nonprofit and voluntary sector quarterly institutional and market pressures on interorganizational collaboration and competition among private human service organizations defining the implementation problem: organizational management versus cooperation implementation matters: a review of research on the influence of implementation on program outcomes and the factors affecting implementation helping the public make health risk decisions implementation research: a synthesis of the literature the world café" in social work research organizational innovation adoption: a multi-level framework of determinants and opportunities for future research how to conduct a mini-ethnographic case study: a guide for novice researchers behavioral public administration: combining insights from public administration and psychology revisiting old friends: networks, implementation structures and the management of inter-organizational relations daily affect and daily beliefs human services as complex organizations managing street-level arbitrariness: the evidence base for public sector quality improvement methodology or method? a critical review of qualitative case study reports successful program development using implementation evaluation organizational policy advocacy for a quality improvement innovation in a maternal and child health network: lessons learned in early implementation recognizing new opportunities: reconceptualizing policy advocacy in everyday organizational practice qualitative description-the poor cousin of health research? identifying attributes of relationship management in nonprofit policy advocacy writing usable qualitative health research findings effectiveness, transportability, and dissemination of interventions: what matters when? workforce development and the organization of work: the science we need. administration and policy in mental health and mental health services research clinical supervision in effectiveness and implementation research the future of nonprofit human services lessons learned from implementing school-based substance abuse prevention curriculums financial struggles of a small community-based organization: a teaching case of the capacity paradox libertarian paternalism is not an oxymoron. university of chicago public law & legal theory working paper 43 nudge: improving decisions about health, wealth, and happiness lessons learned in translating research evidence on early intervention programs into clinical care. mcn the relationship between coping and job performance comprehensive quality programming and accountability: eight essential strategies for implementing successful prevention programs identifying perinatal depression and anxiety: evidence-based practice in screening, psychosocial assessment and management publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations the author declares that this work complied with appropriate ethical standards. the author declares that they have no conflict of interest. key: cord-016196-ub4mgqxb authors: wang, cheng; zhang, qing; gan, jianping title: study on efficient complex network model date: 2012-11-20 journal: proceedings of the 2nd international conference on green communications and networks 2012 (gcn 2012): volume 5 doi: 10.1007/978-3-642-35398-7_20 sha: doc_id: 16196 cord_uid: ub4mgqxb this paper summarizes the relevant research of the complex network systematically based on statistical property, structural model, and dynamical behavior. moreover, it emphatically introduces the application of the complex network in the economic system. transportation network, and so on are of the same kind [2] . emphasis on the structure of the system and the system analysis from structure are the research thinking of the complex network. the difference is that the property of the topological structure of the abstracted real networks is different from the network discussed before, and has numerous nodes, as a result we call it complex network [3] . in recent years, a large number of articles are published in world leading publication such as science, nature, prl, and pnas, which reflects indirectly that complex network has been a new research hot spot. the research in complex network can be simply summarized as contents of three aspects each of which has close and further relationships: rely on the statistical property of the positivist network measurement; understanding the reason why the statistical property has the property it has through building the corresponding network model; forecasting the behavior of the network system based on the structure and the formation rule of the network. the description of the world in the view of the network started in 1736 when german mathematician eular solved the problem of johannesburg's seven bridges. the difference of complex network researching is that you should view the massive nodes and the properties they have in the network from the point of the statistics firstly. the difference of the properties means the different internal structures of the network; moreover the different internal structures of the network bring about the difference of the systemic function. therefore, the first step of our research on complex network is the description and understanding of the statistical properties, sketched as follows: in the research of the network, generally speaking we define the distance between two nodes as the number of the shortest path edge of the two connectors; the diameter of the net as the maximum range between any two points; the average length of the net is the average value of the distance among all the nodes, it represents the degree of separation of the nodes in the net, namely the size of the net. an important discover in the complex network researching is that the average path length of the most of the large-scale real networks is much less than our imagine, which we call ''small-world effect''. this viewpoint comes from the famous experiment of ''milgram small-world'', the experiment required the participators to send a letter to one of their acquaintances making sure the letter reach the recipient of the letter, in order to figure out the distribution of the path length in the network, the result shows that the number of the average passing person is just six, in addition the experiment is also the origin of the popular theory ''6°of separation''. the aggregation extent of the nodes in the network is represented by convergence factor c, that is how close of the network. for example in the social networks, your friend's friend may be your friend or both of your two friends are friends. the computational method is that: assuming node i connect other k i nodes through k i , if the k i connected each other, there should be k i ðk i à 1þ=2 sides among them, however if the k i nodes have e i sides, then the ratio of e i to k i ðk i à 1þ=2 is the convergence factor of node i. the convergence factor of the network is the average value of all the nodes' convergence factor in the network. obviously only is in fully connected network the convergence factor equals 1, in most other networks convergence factor less than 1. however, it proves to be that nodes in most large-scale realworlds network tent to be flock together, although the convergence factor c is far less than 1, it is far more than n à1 . the degree k i of the node i in the graph theory is the total amount of the sides connected by node i, the average of the degree k i of the node i is called average degree of the network, defined as \ k [. the degree of the node in the network is represented by distribution function p(k), the meaning of which is that the probability that any nodes with k sides, it also equals the number of nodes with k degree divide the number of all the nodes in the network. the statistical property described above is the foundation of the complex networks researching; with the further researching we generally discover the realworld network has other important statistical property, such as the relativity among network resilience, betweenness, and degree and convergence factor. the most simple network model is the regular net region; the same number around every node is its characteristic, such as 1 d chain-like, 2 d lattice, complete graph and so on. paul erdös and alfred rényi discovered a complete random network model in the late 50s twentieth century, it is made of any two nodes which connected with probability p in the graph made of n nodes, its average degree is \k [ ¼ pðn à 1þ % pn; the average path length l : ln n= lnð\k [ þ; the convergence factor c ¼ p; when the value of n is very large, the distribution of the node degree approximately equals poisson distribution. the foundation of the random network model is a significant achievement in the network researching, but it can hardly describe the actual property of the realworld, lots of new models are raised by other people. as the experiment puts, most of the realworld networks has small-world (lesser shortest path) and aggregation (larger convergence factor). however, the regular network has aggregation, but its average shortest path length is larger, random graph has the opposite property, having small-world and less convergence factor. so the regular networks and random networks can not reflect the property of the realworld, it shows that the realworld is not well-defined neither is complete random. watts and strogatz found a network which contains both small-world and high-aggregation in 1988, which is a great break in the complex network researching. they connected every side to a new node with probability p,through which they build a network between regular network and random network (calling ws net for short), it has less average path length and larger convergence factor, while the regular network and random network are special case when p is 0 and 1 in the ws net. after the ws model being put forward, many scholars made a further change based on ws model, the nw small-world model raised by newman and watts has the most extensive use. the difference between nw model and ws model is that nw model connects a coupe of nodes, instead of cutting off the original edge in the regular network. the advantage of nw model is that the model simplifies the theory analysis, since the ws model may have orphan nodes which nw would not do. in fact, when p is few while n is large, the results of the theory analysis of the two models will be the same; we call them small-world model now. although the scale-free network can describe the small-world and highaggregation of the realworld well, the theory analysis of the small-world model reals that the distribution of the node is still the index distribution form. as the empirical results put it is more accurate to describe the most of the large-scale realworld model in the form of the power-law namely pðkþ : k àc . compared with index distribution power-law has no peak, most nodes has few connection, while few nodes have lots of connection, there is no characteristic scale as the random network do, so barabási and some other people call this network distribution having power rate characteristics scale-free network. in order to explain the foundation of the scale-free network, barabási and albert found the famous ba model, they thought the networks raised before did not consider the two important property of the realworld-growth property and connection optimization, the former means the new nodes are constantly coming into the network, the latter means after their arriving the new nodes prefer to connect the nodes with large degree. not only do they make the simulation analysis of the generating algorithm of the ba model, but also it has given the analytic solution to the model using the way of the mean field in statistical physics, as the result put: after enough time of evolution, the distribution of ba network don't change with time, degree distribution is power-law with its index number 3 steadily. foundation of the ba model is another great breakout in the complex network research, demonstrating our further understanding of the objective network world. after that, many scholars made many improvements in the model, such as nonlinearity priority connection, faster growth, and local events of rewind side, being aging, and adaptability competition and so on. note that: most instead of all of the realworld is scale-free network, for some realworld network's degree distribution is the truncation form of the power-law. scholars also found some other network model such as local area world evolution model, weight evolution network model and certainty network model to describe the network structure of the realworld besides small-world model and scale-free network. study of the network structure is important, but the ultimate purpose is that we can understand and explain the system's modus operand based on these networks, and then we can forecast and control the behavior of network system. this systemic dynamical property based on network is generally called dynamical behavior, it involves so many things such as systemic transfusion, synchronization, phase change, web search and network navigator. the researched above has strong theoretical, a kind of research of network behavior which has strong applied has increasingly aroused our interests, for example the spread of computer virus on computer net, the spread of the communicable disease among multitude and the spread of rumours in society and so on, all of them are actually some propagation behavior obeying certain rules and spreading on certain net. the traditional network propagation models are always found based on regular networks, we have to review the issue with the further research of the complex networks. we emphatically introduce the research of the application. one of the uppermost and foremost purposes of network propagation behavior research is that we can know the mechanism transmission of the disease well. substitute node for the unit infected, if one unit can associate with another in infection or the other way round through some way, then we regard that the two units have connection, in this way can we get the topological structure of network propagation, the relevant propagation model can be found to study the propagation behavior in turn. obviously, the key to network propagation model studying is the formulation of the propagation rule and the choice of the network topological structure. however, it does not conform to the actual fact simply regarding the disease contact network as regular uniform connect network. moore studied the disease propagation behavior in small-world, discovering that the propagation threshold value of disease in small-world is much less than it does in regular network, in the same propagation degree, experience the same time, the propagation scope of disease in the small-world is significantly greater than the propagation scope in the regular network, that is to say: compared to regular network, disease in the smallworld inflects easily; paster satornas and others studied the propagation behavior in the scale-free world, the result turns out to be amazing: there is always positive propagation degree threshold value in both of regular world and small-world, while the propagation degree threshold value approves to be 0. we can get the similar results when analyzing the scale-free world. as lots of experiments put realworld network has both small-world and scale-free, the conclusion described above is quite frustrated. fortunately, no matter virus or computer virus they all has little infectious (k ¼ 1), doing little harm. however, once the intensity of disease or virus reaches some degree, we have to pay enough attention to it, the measurement to control it can not totally rely on the improvement of medical conditions, we have to take measures to quarantine the nodes and turn off the relevant connections in order to cut off avenue of infection in which we can we change the topological structure of the propagation network. in fact, just in this way can we defeat the war of fighting sars in 2003 summer in our country. the study of the disease's mechanism transmission is not all of the questions our ultimate goal is that we can master how to control disease propagation efficiently. while in practical applications, it is hard to stat the number of nodes namely the number of units which have possibilities connect with other nodes in infection period. for example in the research of std spread, researchers get the information about psychopath and high risk group only through questionnaire survey and oral questioning, while their reply has little reliability, for that reason, quite a lot of immunization strategy have been put forward by some scholars based on above-mentioned opinion, such as ''who is familiar with the immune'', ''natural exposure'', ''vaccination''. analyzing disease spread phenomenon is not just the purpose of researching network propagation behavior; what is more a large amount of things can be analyzed through it. for example we can apply it to propagation behavior's research in social network, the basic ideas showed as follows: first we should abstract the topological structure of the social network out from complex network theory, then analyze the mechanism transmission according to some propagation rules, analyze how to affect the propagation through some ways at last. actually, this kind of work has already started, such as the spread of knowledge, the spread of new product network and bank financial risk; they have both relation and difference, the purpose of the research of the former is to contribute to its spread; the latter is to avoid its spread. systems science. shanghai scientific and technological educational publishing house pearson education statistical mechanics of complex network the structure and function of complex networks key: cord-256713-tlluxd11 authors: welch, david title: is network clustering detectable in transmission trees? date: 2011-06-03 journal: viruses doi: 10.3390/v3060659 sha: doc_id: 256713 cord_uid: tlluxd11 networks are often used to model the contact processes that allow pathogens to spread between hosts but it remains unclear which models best describe these networks. one question is whether clustering in networks, roughly defined as the propensity for triangles to form, affects the dynamics of disease spread. we perform a simulation study to see if there is a signal in epidemic transmission trees of clustering. we simulate susceptible-exposed-infectious-removed (seir) epidemics (with no re-infection) over networks with fixed degree sequences but different levels of clustering and compare trees from networks with the same degree sequence and different clustering levels. we find that the variation of such trees simulated on networks with different levels of clustering is barely greater than those simulated on networks with the same level of clustering, suggesting that clustering can not be detected in transmission data when re-infection does not occur. to understand the dynamics of infectious diseases it is crucial to understand the structure and interactions within the host population. conversely, it is possible to learn something about host population structure by observing the pattern of pathogen spread within it. in either case, it is necessary to have a good model of the host population structure and interactions within it. networks, where nodes of the network represent hosts and edges between nodes represent contacts across which pathogens may be transmitted, are now regularly used to model host interactions [1] [2] [3] . while many models have been proposed to describe the structure of these contact networks for different populations and different modes of transmission, it is not yet understood how different features of networks affect the spread of pathogens. one promising development in this field is the use of statistical techniques which aim to model a contact network based on data relating to the passage of a pathogen through a population. such data includes infection times [4] [5] [6] and genetic sequences that are collected from an epidemic present in the population of interest [7] [8] [9] . these data have previously been shown to be useful for reconstructing transmission histories (the distinction between a contact network and a transmission history is that a contact network includes all edges between hosts across which disease may spread, whereas the transmission history is just the subset of edges across which transmission actually occurred). infection times can be used to crudely reconstruct transmission histories by examining which individuals were infectious at the time that any particular individual was infected [10] . genetic sequences from viruses are informative about who infected whom by comparing the similarity between sequences. due to the random accumulation of mutations in the sequences, we expect sequences from an infector/infectee pair to be much closer to each other than sequences from a randomly selected pair in the population (see [11] for a review of modern approaches to analysing viral genetic data). the work of [4] [5] [6] seeks to extend the use of this data to reconstruct a model for the whole contact network rather than just the transmission history. in theory, these statistical methods could settle arguments about which features of the network are important in the transmission of the disease and which are simply artifacts of the physical system. in this article, we focus on clustering in networks and ask whether or not networks which differ only in their level of clustering could be distinguished if all we observed was transmission data from an epidemic outbreak. the answer to this question will determine whether these new statistical techniques can be extended to estimate the level of clustering in a network. throughout, we consider a population with n individuals that interact through some contact process. this population and its interactions are fully described by a undirected random network, denoted y , on n nodes. an simple example of a network is shown in figure 1 with illustrations of some of the terms we use in this article. y can be represented by the symmetric binary matrix [y ij ] where y ij = y ji = 1 if an edge is present between nodes i and j, otherwise y ij = 0. we stipulate that there no loops in the network, so y ii = 0 for all i. the degree of the ith node, denoted d i is the number of edges connected to i, so d i = j:j>i y ij . clustering is one of the central features of observed social networks [12, 13] . intuitively, clustering is the propensity for triangles or other small cycles to form, so that, for example, a friend of my friend is also likely to be my friend. where there is a positive clustering effect, the existence of edges (i, j) and (i, k) increases the propensity for the edge (j, k) to exist, while a negative clustering effect implies that (j, k) is less likely to exist given the presence of (i, j) and (i, k). when there is no clustering effect, the presence or absence of (i, j) and (i, k) has no bearing on that of (j, k). thus clustering is one of the most basic of the true network effects-when it is present, the relationship between two nodes depends not only on properties of the nodes themselves but the presence or absence of other relationships in the network. the effect of clustering on the dynamics of stochastic epidemics that run over networks remains largely unknown, though it has been studied in a few special cases. the difficulty with studying this effect in isolation is in trying to construct a network model where clustering can change but other properties of the network are held constant. in simulations we study here, we focus on holding the degree sequence of a network constant-that is, each node maintains the same number of contacts-while varying the level of clustering. intuition suggests that clustering will have some effect on epidemic dynamics since, in a graph with no cycles, if an infection is introduced to a population at node i and there is a path leading to j then k, k can only become infected if j does first. however, where cycles are present, there may be multiple paths leading from i to k that do not include j, so giving a different probability that k becomes infected and a different expected time to infection for k. figure 1 . an example of a network on 7 nodes. the nodes are the red dots, labelled 1 to 7 and represent individuals in the population. the edges are shown as black lines connecting the nodes and represent possible routes of transmission. the degree of each node is number of edges adjacent to it, so that node 5 has degree 3 and node 7 has degree 1. the degree sequence of the network is the count of nodes with a given degree and can be represented by the vector (0, 2, 0, 3, 1, 1) showing that there are 0 nodes of degree 0, 2 of degree 1, 0 of degree 2 and so on. a cycle in the network is a path starting at a node and following distinct edges to end up back at the same node. for example, the path from node 6 to node 1 to node 3 and back to node 6 is a cycle but there is no cycle that includes node 4. clustering is a measure of propensity of cycles of length 3 (triangles) to form. here, the edges (2,1) and (2,6) form a triangle with the edge (1,6), so work to increase clustering in the network. however, the edges (2,1) and (2,5) do not comprise part of a triangle as (1,5) does not exist, so work to decrease clustering. previous work on the effect of clustering on epidemic dynamics has produced a variety of results which are largely specific to particular types of networks. newman [14] and britton et al. [15] show that for a class of networks known as random intersection graphs in which individuals belong to one or more overlapping groups and groups form fully connected cliques, an increase in clustering reduces the epidemic threshold, that is, major outbreaks may occur at lower levels of transmissibility in highly clustered networks. newman [14] , using heuristic methods and simulations, suggests that for sufficiently high levels of transmissibility the expected size of an outbreak is smaller in a highly clustered network than it would be in a similar network with lower clustering. these articles show that graphs with different levels of clustering do, at least in some cases, have different outbreak probabilities and final size distributions for epidemic outbreaks. kiss and green [16] provide a succinct rebuttal to the suggestion that the effects found by [14] and [15] are solely due to clustering. they show that, while the mean degree of the network is preserved in the random intersection graph, the degree distribution varies greatly (in particular, there are many zero-degree nodes) and variance of this distribution increases with clustering. an increase in the variance of the degree distribution has previously been shown to lower the epidemic threshold. they demonstrate that a rewiring of random intersection graphs that preserves the degree sequence but decreases clustering produces networks with similarly lowered epidemic thresholds and even smaller mean outbreak sizes. our experiments, reported below, are similar in spirit to those of [16] but look at networks with different degree distributions and study in detail how epidemic data from networks with varying levels of clustering might vary. ball et al. [17] show, using analytical techniques, that clustering induced by household structure in a population (where individuals have many contacts with individuals in the same household and fewer global contacts with those outside of the household) has an effect on probability of an outbreak and the expected size of any outbreak. the probability of an outbreak, in some special cases, is shown to be monotonically decreasing with clustering coefficient and the expected outbreak size also decreases with clustering. there is no suggestion that these results will apply to clustered networks outside of this specific type of network or that they apply when degree distributions are held constant. eames [18] also studies networks with two types of contacts: regular contacts (between people who live or work together, for example) and random contacts (sharing a train ride, for example). using simulations of a stochastic epidemic model and deterministic approximations, it is shown that both outbreak final size and probability of an outbreak are reduced with increased clustering, particularly when regular contacts dominate. as the number of random contacts increases, the effect of clustering reduces to almost zero. strong effects on the expected outbreak size in networks with no random contacts are observed for values of the clustering coefficient above about 0.4, however, no indication of the magnitude of the variance of these effects is given. keeling [19] reports similar results, introducing clustering to a network using a spatial technique-nodes live in a two-dimensional space and two nodes are connected by an edge with a probability inversely proportional to their distance. the clustering comes about by randomly choosing positions in space to which nodes are attracted before connections are made. the results suggest that changes in clustering at lower levels has little effect on the probability of an outbreak, but as the clustering coefficient reaches about 0.45, the chance of an outbreak reduces significantly. as in [14] and [15] , while the mean degree of network nodes is held constant here, nothing is said about the degree distribution as clustering varies. serrano and boguñá [20] look specifically at infinite power-law networks and shows that the probability of an outbreak increases as clustering increases but the expected size of an outbreak decreases. some more recent papers seek to distinguish the effects of clustering from confounding factors such as assortativity and degree sequence. miller [21] develops analytic approximations to study the interplay of various effects such as clustering, heterogeneity in host infectiousness and susceptibility and the weighting of contacts on the spread of disease over a network. the impact of clustering on the probability and size of an outbreak is found to be small on "reasonable" networks so long as the average degree of the network is not too low. the rate at which the epidemic spreads, measured by the reproduction number, r 0 , is found to reduce with increased clustering in such networks. in networks with low mean degree, r 0 may be reduced to point of affecting the probability and size of an outbreak. miller [22] points out that studies of the effects of clustering should take into account assortativity in the network, that is, the correlations in node degree between connected nodes. assortativity has been shown to affect epidemic dynamics and changing the level of clustering in a network can change the level of assortativity. to distinguish between the effects of assortativity and clustering, a method of producing networks with arbitrary degree distributions and arbitrary levels of clustering with or without correlated degrees is presented and studied using percolation methods. the effect of increasing clustering in these models is to reduce the probability of outbreaks and reduce the expected size of an epidemic. badham and stocker [23] use simulated networks and epidemics to study the relationship between assortativity and clustering. their results suggest that increased clustering diminished the final size of the epidemic, while the effect of clustering on probability of outbreak was not very clear. like [23] , moslonka-lefebvre et al. [24] use simulations to try to distinguish the effects of clustering and assortativity but look at directed graphs. here, they find that clustering has little effect on epidemic behaviour. melnik et al. [25] propose that the theory developed for epidemics on unclustered (tree-like) networks applies with a high degree of accuracy to networks with clustering so long as the network has a small-world property [12] . that is, if the mean length of the shortest path between vertices of the clustered network is sufficiently small, quantities such as the probability of an outbreak on the network can be estimated using known results that require only the degree distribution and degree correlations. the theory is tested using simulations on various empirical networks from a wide range of domains and synthetic networks simulated from theoretical models. taken together, these studies show that clustering can have significant effects on crucial properties of epidemics on networks such as the probability, size and speed of an outbreak. these results primarily relate to the final outcome and mean behaviour of epidemics. however, if we can obtain a transmission tree for an outbreak then we have information from the start to the finish of a particular epidemic including times of infection and who infected whom. since epidemics are stochastic processes, data from a particular epidemic may differ considerably from the predicted mean. whether or not such data contains information about clustering in the underlying network is the question we seek to address here. we simulate epidemics over networks with fixed degree distributions and varying levels of clustering and inspect various summary statistics of the resulting epidemic data, comparing the summaries for epidemics run over networks with the same degree distribution but different levels of clustering. the precise details of the simulations are described in section 2. the results of the simulations, presented in section 3, show that there is likely little to no signal of clustering in a contact network to be found in a single realisation of an epidemic process over that network. we conclude that it is unlikely that clustering parameters can be inferred solely from epidemiological data that relates to the transmission tree and suggest that further work in parameter estimation for contact networks would be best focused on other properties of contact networks such as degree distribution or broader notions of population structure. we simulate multiple networks from two network models: a bernoulli model [26] and a power-law model [27] . under the bernoulli model (also called the erdős-rényi or binomial model), an edge between nodes i and j is present with some fixed probability 0 ≤ p ≤ 1 and absent with probability 1 − p, independently of all other edges. due to their simplicity, bernoulli networks are well-studied and commonly used in disease modeling but are not generally thought to be accurate models of social systems. a bernoulli network is trivial to construct by sampling first the total number of edges in a the graph where n is the number of nodes in the network, and then sampling |y | edges uniformly at random without replacement. we set n = 500 and p = 7/n = 0.014 in the simulations reported below. a power-law network is defined as having a power-law degree distribution, that is, for nodes i = 1, . . . , n , p (d i = k) ∝ k −α for some α > 0. power-law networks are commonly used to model social interactions and various estimates of α in the range 1.5-2.5 have been claimed for observed social networks. in the model used here, we set α = 1.8. we simulate power-law using a reed-molloy type algorithm [28] . that is, the degree of each node, d i , i = 1, . . . , n , is sampled from the appropriate distribution. node i is then assigned d i "edge stubs" and pairs of stubs are sampled uniformly without replacement to be joined and become edges. when all stubs have been paired, loops are removed and multiple edges between the same nodes are collapsed to single edges. this last step of removing loops and multiple edges causes the resulting graph to be only an approximation of a power law graph but the approximation is good for even moderately large n . we set n = 600 and consider only the largest connected component of the network in the simulation reported below. the size of the networks considered here is smaller than some considered in simulation studies though on a par with others (see, for example, [25] who looks a a wide range of network sizes). we choose these network sizes partly for convenience and partly because the current computational methods for statistical fitting of epidemic data to network models would struggle with networks much larger than a few hundred nodes [6] so our interest is in networks around this size. from each sampled network, y , we generate two further networks, y hi and y lo that preserve the degrees of all nodes in y but have, respectively, high and low levels of clustering. we achieve this using a monte carlo algorithm implemented in the ergm package [29] in r [30] that randomly rewires the input network while preserving the degree distribution. a similar algorithm is implemented in bansal et al. [31] . for details of the ergm model and implementation of this algorithm, we refer the reader the package manual [32] and note that the two commands used to simulate our networks are y_hi = simulate(y˜gwesp(0.2,fixed=t), theta0 = 5,... constraints =˜degreedist, burnin=5e+5) and y_lo = simulate(y˜gwesp(0.2,fixed=t), theta0 = -5,... constraints =˜degreedist, burnin=5e+5) we measure clustering in the resulting networks using the clustering coefficient [12] , defined as follows. let n i = {j|y ij = 1} be the neighbourhood of vertex i and d i = |n i | be the degree of i. let n i = j 1, the local clustering coefficient is , which is the ratio of extant edges between neighbours of i to possible edges. for d i ∈ {0, 1}, let c i = 0. the (global) clustering coefficient is the mean of the local coefficients, 1} is somewhat arbitrary, though other possible choices, such as c i = 1 or excluding those statistics from the mean, give similar qualitative results in our experiments. over each simulated network, we simulate a stochastic susceptible-exposed-infectious-removed (seir) epidemic. all nodes are initially susceptible to the infection. the outbreak starts when a single node is chosen uniformly at random and exposed to a disease. after a gamma-distributed waiting period with mean k e θ e and variance k e θ 2 e , the node becomes infectious. the infection may spread across the edges of the network, from infectious nodes to susceptible nodes according to a poisson process with rate β. infected nodes recover after an infectious period with a gamma distributed waiting time with mean k i θ i and variance k i θ 2 i . once a node is recovered, it plays no further part in the spread of the infection. the process stops when there are no longer any exposed or infectious nodes. for each pair, y hi and y lo , we start the infection from the same node. we condition on the outbreak infecting at least 20 nodes. the parameter values are set at β = 0.1, k e = k i = 1 and θ e = θ i = 3 in the simulations reported below. a transmission tree encodes all information about the epidemic outbreak it describes. as such, it is a very complicated object. to compare sets of transmission trees and decide whether there are some systematic differences between them, we rely on various summary statistics derived from the trees and compare the distribution of the summaries over the ensembles in question. the summaries we use can be divided into two groups, those relating solely to the number of infected through time and those relating to topology of the tree. the first group of summaries can all be derived from the epidemic curves, that is, the number infected as a function of time. from this, we derive scalar summaries being the total number of individuals infected, the length of the epidemic (measured from the time of the first infection to the last recovery), the maximum of the epidemic curve and the time of that maximum. we label each individual in the population (equivalently, each node in the contact network) with labels 1, . . . , n . a transmission tree, a distinct graph from the contact network, has a time component and can be defined as follows; an example of a transmission tree and the notation is given in figure 2 . there are three types of nodes in a transmission tree (not to be confused with nodes in the contact network): the root node corresponding to the initial infection, transmission or internal nodes corresponding to transmission events, and leaf or external nodes corresponding to recovery events. leaf nodes are defined by the time and label pair (t i , u i ) where t ≥ 0 is the time of the recovery event and u i is the label of individual that recovered. the internal nodes are associated with the triple (t i , u i , v i ) being the time of the event, t i , the label u i of the exposed individual and v i that is the transmitter or "parent" of the infection. the root node is like an internal node but the infection parent is given as 0, so is denoted (t 0 , u 0 , 0). the branches of the tree are times between infection, transmission and recovery events for a particular vertex. for example, if the individual labelled u is infected at event (t 1 , u, v 1 ), is involved in transmission events (t k , v k , u), k = 2, . . . , m − 1, and recovers at (t m , u) where t i < t j for i < j and {v 1 , . . . , u m−1 } are other individuals in the population, there are m − 1 branches of the transmission tree at u defined by the intervals (t i , t i+1 ], for i = 1, . . . , m − 1. we summarise the transmission tree using the following statistics: the mean branch length between internal nodes (corresponding to the mean time between secondary infections for each individual); the mean branch length of those branches adjacent to a leaf node (which corresponds to the mean time from the last secondary infection to removal for each individual); the number of secondary infections caused by each infected individual (that is, for each infected individual v we count the number of internal nodes that have the form (t i , u i , v), for some i); and, the distribution of infective descendants for each individual, v, which is defined recursively as the sum of secondary infections caused by v and the secondary infections caused by the secondary infections of v and so on. an equivalent definition is to say that number of infective descendants of v is the number of leaves that have a node of the form (t, u i , v) as an ancestor. finally, we consider the number of cherries in the tree [33] which is the number of pairs of leaves that are adjacent to a common internal node. this simple statistic is chosen as it is easy to compute and contains information about the topology or shape of the tree. to compare the number of cherries in outbreaks of different size, we look at the ratio of extant cherries to the maximum possible number of cherries for the given outbreak. the experimental pipeline can thus be summarised as: 1. repeat for i = 1, . . . , 500: (a) sample a graph y i according to given degree distribution. (b) simulate two further graphs y hi i and y lo i with high clustering and low clustering, respectively, using a monte carlo sampler that rewires y i to alter the clustering level while preserving the degree of each node. we report results here for seir epidemics run over bernoulli and power-law networks. a number of smaller trials that we do not report were run: with different values chosen for the network and epidemic parameters; on networks with the same degree distributions as a random intersection graph; and, using an sir epidemic rather than an seir. the results for those smaller trials were qualitatively similar to the results reported here. the distributions of the measured clustering coefficients is shown in figure 3 and show that the simulated networks with high and low clustering for a given degree distribution are easily distinguished from one another. the bernoulli networks with low clustering contain no triangles, so the clustering coefficient for each of these networks is zero, while for highly-clustered bernoulli networks, clustering coefficients are in the range (0.28,0.33). for the power-law networks, the low clustered networks have clustering in the range (0.00,0.09) while the highly clustered networks have clustering in the range (0.24,0.38). figures 4 and 5 show comparisons of summary statistics for networks with differing levels of clustering and bernoulli degree distributions. the summaries show some differences between the outbreaks on the differently clustered networks. in particular, the outbreaks in the highly-clustered networks spread more slowly, on average, leading to marginally longer epidemics with fewer individuals infected at the peak of the outbreak, that occurs slightly later, than we see in outbreaks on the networks with low clustering. these mean effects are in line with the predictions of [22] . the variances of the measured statistics, however, are sufficiently large due to stochastic effects in the model that the ranges of the distributions overlap almost completely in most cases. statistics derived from the transmission tree appear to add little information, with only the number of cherries differing in the mean. figures 6 and 7 show the corresponding distributions for networks with power-law degree distributions. again, differences in the means between the two sets of statistics are apparent with the mean length of epidemic, total number infected and number infected at peak all lower in the epidemics on networks with high-clustering. the largest difference is found in the total number infected, where in the low-clustered networks, the range of the statistic is (231, 445) while it is just (211, 361) in the high-clustered networks. the primary cause here is due to the change in size of the largest connected component of the network. if we adjust for this by looking instead at the proportion of the giant component infected, the distributions again overlap almost completely with the range for the proportion infected in the low-clustered networks being (0.39, 0.74) and (0.42, 0.74) for the high-clustered networks. the results presented above suggest that the behaviour of an epidemic on a random network with a given degree sequence is relatively unaffected by the level of clustering in the network. some effect is seen, but it is small relative to the random variation we see between epidemics on similarly clustered networks. the results also suggest that the complete transmission tree from an epidemic provides little information about clustering that is not present in the epidemic curve. these results do not imply that clustering has little effect, rather they suggest as noted in [16] , the apparently strong effect of clustering observed by some is more likely to due to a change in the degree distribution-an effect we have nullified by holding the degree sequence constant. these broader effects are probably best analysed on a grosser level such as the household or subgroup level rather than at the individual level at which clustering is measured. our simulation method, in which the degree sequence for each network is held constant while clustering levels are adjusted, places significant restrictions on the space of possible graphs and therefore clustering coefficients. the levels of clustering achieved in the simulations reported here (for example, having a clustering coefficient in the low-clustered bernoulli case of 0 versus a mean of 0.30 for the high-clustered case) are not so high as those considered in the some of the simulations and theoretical work described in section 1, and this may partly account for the limited effect on epidemic outcomes that we find here. there is little known about the levels of clustering found in real contact networks [31] (though one recent detailed study [34] find values for clustering in a social contact network in the region 0.15-0.5) and no evidence to suggest that very extreme values of clustering are achieved for a given degree sequence. it is plausible, however, that the degree sequence of a social network of interest could be found-for example, via ego-centric or full-network sampling [34] [35] [36] -and therefore reasonable to explore the achievable levels of clustering conditional on the degree sequence. in doing so, we separate the effects on epidemic dynamics of change in the degree sequence of the contact network from those of clustering. from a statistical point of view, these results indicate that even with full data from a particular epidemic outbreak, such as complete knowledge of the transmission tree, it is unlikely that the level of clustering in the underlying contact network could be accurately inferred independently of the degree distribution. this is primarily due to the large stochastic variation found from one epidemic to the next that masks the relatively modest effects of clustering on an outbreak. with this much stochastic noise, we suggest that it would require data from many outbreaks over the same network (that is, pathogens with a similar mode of transmission spreading in the same population) to infer the clustering level of that network with any accuracy. the results also suggest that attempting to estimate a clustering parameter without either estimating or fixing the degree sequence, as in goudie [37] , may see the estimated clustering parameter acting chiefly a proxy for the degree sequence. it cannot be ruled out that a statistical method, which takes into account the complete data rather than the summaries we use here, or which takes data from parts of the parameter space that we have not touched on here, could find some signal of clustering from such data. in practise, however, it would be highly unusual to have access to anything approaching complete data. a more realistic data set might include times of onset and recovery from disease symptoms for some individuals in the population and sequences taken from viral genetic material. the noise that characterises such data sets already makes it difficult to accurately reconstruct the transmission tree; this extra uncertainty would likely make any inference of a clustering parameter, in the absence of other information, very difficult. i thank david hunter, marcel salathé, mary poss and an anonymous referee for useful comments and references that improved this paper. this work is supported by nih grant r01-gm083603-01. a survey of statistical network models. foundations and trends in machine learning network epidemiology: a handbook for survey design and data collection the structure and function of complex networks bayesian inference for stochastic epidemics in populations with random social structure. scand bayesian inference for contact networks given epidemic data. scand a network-based analysis of the 1861 hagelloch measles data episodic sexual transmission of hiv revealed by molecular phylodynamics integrating genetic and epidemiological data to determine transmission pathways of foot-and-mouth disease virus statistical inference to advance network models in epidemiology different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures evolutionary analysis of the dynamics of viral infectious disease collective dynamics of small world networks why social networks are different from other types of networks properties of highly clustered networks epidemics on random graphs with tunable clustering comment on "properties of highly clustered networks analysis of a stochastic sir epidemic on a random network incorporating household structure modelling disease spread through random and regular contacts in clustered populations the implications of network structure for epidemic dynamics percolation and epidemic thresholds in clustered networks spread of infectious disease through clustered populations percolation and epidemics in random clustered networks the impact of network clustering and assortativity on epidemic behaviour. theor disease spread in small-size directed networks: epidemic threshold, correlation between links to and from nodes, and clustering the unreasonable effectiveness of tree-based theory for networks with clustering on random graphs statistical mechanics of complex networks a critical point for random graphs with a given degree sequence a package to fit, simulate and diagnose exponential-family models for networks, version 2.2-2 r development core team. r: a language and environment for statistical computing. r foundation for statistical computing exploring biological network structure with clustered random networks a package to fit, simulate and diagnose exponential-family models for networks distributions of cherries for two models of trees a high-resolution human contact network for infectious disease transmission using data on social contacts to estimate age-specific transmission parameters for respiratory-spread infectious agents social contacts and mixing patterns relevant to the spread of infectious diseases what does a tree tell us about a network? this article is an open access article distributed under the terms and conditions of the creative commons attribution license key: cord-015967-kqfyasmu authors: tagore, somnath title: epidemic models: their spread, analysis and invasions in scale-free networks date: 2015-03-20 journal: propagation phenomena in real world networks doi: 10.1007/978-3-319-15916-4_1 sha: doc_id: 15967 cord_uid: kqfyasmu the mission of this chapter is to introduce the concept of epidemic outbursts in network structures, especially in case of scale-free networks. the invasion phenomena of epidemics have been of tremendous interest among the scientific community over many years, due to its large scale implementation in real world networks. this chapter seeks to make readers understand the critical issues involved in epidemics such as propagation, spread and their combat which can be further used to design synthetic and robust network architectures. the primary concern in this chapter focuses on the concept of susceptible-infectious-recovered (sir) and susceptible-infectious-susceptible (sis) models with their implementation in scale-free networks, followed by developing strategies for identifying the damage caused in the network. the relevance of this chapter can be understood when methods discussed in this chapter could be related to contemporary networks for improving their performance in terms of robustness. the patterns by which epidemics spread through groups are determined by the properties of the pathogen carrying it, length of its infectious period, its severity as well as by network structures within the population. thus, accurately modeling the underlying network is crucial to understand the spread as well as prevention of an epidemic. moreover, implementing immunization strategies helps control and terminate theses epidemics. for instance, random networks, small worlds display lesser variation in terms of neighbourhood sizes, whereas spatial networks have poisson-like degree distributions. moreover, as highly connected individuals are of more importance considering disease transmission, incorporating them into the current network is of outmost importance [4] . this is essential in case of capturing the complexities of disease spread. architecturally, scale-free networks are heterogenous in nature and can be dynamically constructed by adding new individuals to the current network structure one at a time. this strategy is similar to naturally forming links, especially in case of social networks. moreover, the newly connected nodes or individuals link to the already existent ones (with larger connections) in a manner that is preferential in nature. this connectivity can be understood by a power-law plot with the number of contacts per individual, a property which is regularly observed in case of several other networks like that of power grids, world-wide-web, to name a few [14] . epidemiologists have worked hard on understanding the heterogeneity of scalefree networks for populations for a long time. highly connected individuals as well as hub participants have played essential roles in the spread and maintenance of infections and diseases. figure 1 .1 illustrates the architecture of a system consisting of a population of individuals. it has several essential components, namely, nodes, links, newly connected nodes, hubs and sub-groups respectively. here, nodes correspond to individuals and their relations are shown as links. similarly, newly connected nodes correspond to those which are recently added to the network, such as initiation of new relations between already existing and unknown individuals [24] . hubs are fig. 1.1 a synthetic scale-free network and its characteristics those nodes which are highly connected, such as individuals who are very popular among others and have many relations and/or friends. lastly, sub-groups correspond to certain sections of the population which have individuals with closely associated relationships, such as group of nodes which are highly dense in nature, or having high clustering coefficient. furthermore, it is important in having large number of contacts as the individuals are at greater risk of infection and, once infected, can transmit it to others. for instance, hub individuals of such high-risk individuals help in maintaining sexually transmitted diseases (stds) in different populations where majority belong to long-term monogamous relationships, whereas in case of sars epidemic, a significant proportion of all infections are due to high risk connected individuals. furthermore, the preferential attachment model proposed by barabási and albert [4] defined the existence of individuals of having large connectivity does not require random vaccination for preventing epidemics. moreover, if there is an upper limit on the connectivity of individuals, random immunization can be performed to control infection. likewise, the dynamics of infectious diseases has been extensively studied in case of scale-free as well as small-world and random networks. in small-world networks, most of the nodes may not be direct neighbors, but can be reached from all other nodes via less number of hops, that are number of nodes between start and terminating nodes. also, in these networks distance, dist, between two random nodes increases proportionally to the logarithm of the number of nodes, tot, in the network [15] , i.e., dist ∝ log tot (1.1) watts and strogatz [24] identified a class of small-world networks and categorized them as random graphs. these were classified on the basis of two independent features, namely, average shortest path length and clustering coefficient. as per erdős-rényi model, random graphs have a smaller average shortest path length and small clustering coefficient. watts and strogatz on the other hand demonstrated that various real-world networks have a smaller average shortest path length along with high clustering coefficient greater than expected randomly. it has been observed that it is difficult to block and/or terminate an epidemic in scale-free networks with slow tails. it has especially been seen in case the network correlations among infections and individuals are absent. another reason for this effect is the presence of hubs, where infections could be sustained and reduced by target-specific selections [17] . it has been well known that real-world networks ranging from social to computers are scale-free in nature, whose degree distribution follows an asymptotic power-law. these are characterized by degree distribution following a power law, for the number of connections, conn for individuals and η is an exponent. barabási and albert [4] analyzed the topology of a portion of the world-wide-web and identified 'hubs'. the terminals had larger number of connections than others and the whole network followed a power-law distribution. they also found that these networks have heavy-tailed degree distributions and thus termed them as 'scale-free'. likewise, models for epidemic spread in static heavy-tailed networks have illustrated that with a degree distribution having moments resulted in lesser prevalence and/or termination for smaller rates of infection [14] . moreover, beyond a particular threshold, this prevalence turns to non-zero. similarly, it has been seen that for networks following power-law, does not exist and the prevalence is non-zero for any infection rates. due to this reason, epidemics are difficult to handle and terminate in static networks having powerlaw degree distributions. likewise, in various instances, networks are not static but dynamic (i.e., they evolve in time) via some rewiring processes, in which edges are detached and reattached according to some dynamic rule. steady states of rewiring networks have been studied in the past. more often, it has been observed that depending on the average connectivity and rewiring rates, networks reach a scale-free steady state, with an exponent, η , represented using dynamical rates [17] . the study of epidemics has always been of interest in areas where biological applications coincide with social issues. for instance, epidemics like influenza, measles, and stds, can pass through large group of individuals, populations, and/or persist over longer timescales at low levels. these might even experience sudden changes of increasing and decreasing prevalence. furthermore, in some cases, single infection outbreaks may have significant effects on a complete population group [1] . epidemic spreading can also occur on complex networks with vertices representing individuals and the links representing interactions among individuals. thus, spreading of diseases can occur over the network of individuals as spreading of computer viruses occur over the world-wide-web. the underlying network in epidemic models is considered to be static while the individual states vary from infected to non-infected individuals according to certain probabilistic rules. furthermore, the evolution of an infected group of individuals in time can be studied by focusing on the average density of infected individuals in steady state. lastly, the spread as well as growth of epidemics can also be monitored by studying the architecture of the network of individuals as well as its statistical properties [2] . one of the essential properties of epidemic spread is its branching pattern, thereby infecting healthy individuals over a time period. this branching pattern of epidemic progression can be classified on the basis of their infection initiation, spread and further spread ( fig. 1.3) [5]. 1. infection initiation: if an infected individual comes in contact with a group of individuals, the infection is transmitted to each with a probability p, independent of one another. furthermore, if the same individual meets k others while being infected, these k individuals form the infected set. due to this random disease transmission from the initially infected individual, those directly connected to it get infected. if infection in a branching process reaches an individual set and fails to infect healthy individuals, then termination of the infection occurs, which leads to no further progression and infection of other healthy individuals. thus, there may be two possibilities for an infection in a branching process model. either it reaches a site infecting no further and terminating out, or it continues to infect healthy individuals through contact processes. the quantity which can be used to identify whether an infection persist or fades out is defined as basic reproductive number [6] . this basic reproductive number, τ, is the expected number of newly infected individuals caused by a single already infected individual. in case where every individual meets k new people and infects each with probability p, the basic reproductive number is represented as it is quite essential as it helps in identifying whether or not an infection can spread through a population of healthy individuals. the concept of τ was first proposed by alfred lotka, and applied in the area of epidemiology by macdonald [13] . for non-complex population models, τ can be identified if information for 'death rate' is present. thus, considering death rate, d, and birth rate, b, at the same time, moreover, τ can also be used to determine whether an infection will terminate, i.e., τ < 1 or it becomes an epidemic, i.e., τ > 1. but, it cannot be used for comparing different infections at the same time on the basis of multiple parameters. several methods, such as identifying eigenvalues, jacobian matrix, birth rate, equilibrium states, population statistics can well be used to analyze and handle τ [18] . there are some standard branching models that are existent for analyzing the progress of infection in a healthy population or network. the first one, reed-frost model, considers a homogeneous close set consisting of total number of individuals, tot. let num designate the number of individuals susceptible to infection at time t = 0 and m num the number of individuals infected by the infection at any time t [19] . here, here, eq. 1.7 is in case of a smaller population. it is assumed that an individual x is infected at time t, whereas any individual y comes in contact with x with a probability a num , where a > 0. likewise, if y is susceptible to infection then it becomes infected at time t + 1 and x is removed from the population ( fig. 1.4a ). in this figure, x or v 1 ( * ) represents the infection start site, y(v 3 ), v 2 are individuals that are susceptible to infection, num = 0, tot = 11, and m num = 1. the second one, 3-clique model constructs a 3-clique sub-network randomly by assigning a set of tot individuals. here, for individual/vertex pair (v i , v j ) with probability p 1 , the pair is included along with vertices triples here, g 1 , g 2 are two independent graphs, where g 1 is a bernoulli graph with edge probability p 1 and g 2 with all possible triangles existing independently with a probability p 2 ( fig. 1.4b ). in this figure, 9 ) are the three 3-clique sub-networks with tot = 9, and g = g 1 g 2 g 3 respectively [21] . the third one, household model assumes that for a given a set of tot individuals or vertices, g 1 is a bernoulli graph consisting of tot b disjoint b−cliques, where b tot with edge probability p 2 . thus, the network g is formed as the superposition of the graphs g 1 and g 2 , i.e., g = g 1 g 2 . moreover, g 1 fragments the population into mutually exclusive groups whereas g 2 describes the relations among individuals in the population. thus, g 1 does not allow any infection spread, as there are no connections between the groups. but, when the relationship structure g 2 is added, the groups are linked together and the infection can now spread using relationship connections ( fig. 1.4c ). in this figure, tot = 10 where the individuals (v 1 to v 10 ) are linked on the basis of randomly assigned p 2 and b = 4 tot = 10. the fig. 1 .5b-d respectively [23] . thus, it is essential to identify the conditions which results in an epidemic spread in one network, with the presence of minimal isolated infections on other network components. moreover, depending on the parameters of individual sub-networks and their internal connectivities, connecting them to one another creates marginal effect on the spread of epidemic. thus, identifying these conditions resulting in analyzing spread of epidemic process is very essential. in this case, two different interconnected network modules can be determined, namely, strongly and weakly coupled. in the strongly coupled one, all modules are simultaneously either infection free or part of an epidemic, whereas in the weakly coupled one a new mixed phase exists, where the infection is epidemic on only one module, and not in others [25] . generally, epidemic models consider contact networks to be static in nature, where all links are existent throughout the infection course. moreover, a property of infection is that these are contagious and spread at a rate faster than the initially infected contact. but, in cases like hiv, which spreads through a population over longer time scales, the course of infection spread is heavily dependent on the properties of the contact individuals. the reason for this being, certain individuals may have lesser contacts at any single point in time and their identities can shift significantly with the infection progress [25] . thus, for modeling the contact network in such infections, transient contacts are considered which may not last through the whole epidemic course, but only for particular amount of time. in such cases, it is assumed that the contact links are undirected. furthermore, different individual timings do not affect those having potential to spread an infection but the timing pattern also influences the severity of the overall epidemic spread. similarly, individuals may also be involved in concurrent partnerships having two or more actively involved ones that overlap in time. thus, the concurrent pattern causes the infection to circulate vigorously through the network [22] . in the last decade, considerable amount of work has been done in characterizing as well as analyzing and understanding the topological properties of networks. it has been established that scale-free behavior is one of the most fundamental concepts for understanding the organization various real-world networks. this scale-free property has a resounding effect on all aspect of dynamic processes in the network, which includes percolation. likewise, for a wide range of scale-free networks, epidemic threshold is not existent, and infections with low spreading rate prevail over the entire population [10] . furthermore, properties of networks such as topological fractality etc. correlate to many aspects of the network structure and function. also, some of the recent developments have shown that the correlation between degree and betweenness centrality of individuals is extremely weak in fractal network models in comparison with non-fractal models [20] . likewise, it is seen that fractal scale-free networks are dis-assortative, making such scale-free networks more robust against targeted perturbations on hubs nodes. moreover, one can also relate fractality to infection dynamics in case of specifically designed deterministic networks. deterministic networks allow computing functional, structural as well as topological properties. similarly, in case of complex networks, determination of topological characteristics has shown that these are scale-free as well as highly clustered, but do not display small-world features. also, by mapping a standard susceptible, infected, recovered (sir) model to a percolation problem, one can also find that there exists certain finite epidemic threshold. in certain cases, the transmission rate needs to exceed a critical value for the infection to spread and prevail. this also specifies that the fractal networks are robust to infections [11] . meanwhile, scale-free networks exhibit various essential characteristics such as power-law degree distribution, large clustering coefficient, large-world phenomenon, to name a few [16] . network analysis can be used to describe the evolution and spread of information in the populations along with understanding their internal dynamics and architecture. specifically, importance should be given to the nature of connections, and whether a relationship between x and y individuals provide a relationship between y and x as well. likewise, this information could be further utilized for identifying transitivitybased measures of cohesion ( fig. 1.6 ). meanwhile, research in networks also provide some quantitative tools for describing and characterizing networks. degree of a vertex is the number of connectivities for each vertex in the form of links. for instance, degree(v 4 ) = 3, degree(v 2 ) = 4 (for undirected graph (fig. 1.6a) ). similarly for fig. 1 likewise, shortest path is the minimum number of links that needs to be parsed for traveling between two vertices. for instance, in fig. 1 diameter of network is the maximum distance between any two vertices or the longest of the shortest walks. thus, in fig. 1 [15] . radius of a network is the minimum eccentricity (eccentricity of a vertex v i is the greatest geodesic distance), i.e., distance between two vertices in a network is the number of edges in a shortest path connecting them between v i and any other vertex of any vertex. for instance, in fig. 1 .6b, radius of network = 2. betweenness centrality (g(v i )) is equal to the number of shortest paths from all vertices to all others that pass through vertex v i , i.e., is the number of those paths that pass through v i . thus, in fig. 1 similarly, closeness centrality (c(v i )) of a vertex v i describes the total distance of v i to all other vertices in the network, i.e., sum the shortest paths of v i to all other vertices in the network. for instance, in fig. 1.6b, c( lastly, stress centrality (s(v i )) is the simple accumulation of the number of shortest paths between all vertex pairs, sometimes interchangeable with betweenness centrality [14] . use of 'adjacency matrix', a v i v j , describing the connections within a population is also persistent. likewise, various network quantities can be ascertained from the adjacency matrix. for instance, size of a population is defined as the average number of contacts per individual, i.e., the powers of adjacency matrix can be used to calculate measures of transitivity [14] . one of the key pre-requisites of network analysis is initial data collection. for performing a complete mixing network analysis for individuals residing in a population, every relationship information is essential. this data provides great difficulty in handling the entire population, as well as handling complicated network evaluation issues. the reason being, individuals have contacts, and recall problems are quite probable. moreover, evaluation of contacts requires certain information which may not always be readily present. likewise, in case of epidemiological networks, connections are included if they explain relationships capable of permitting the transfer of infection. but, in most of the cases, clarity of defining such relations is absent. thus, various types of relationships bestow risks and judgments that needs to be sorted for understanding likely transmission routes. one can also consider weighted networks in which links are not merely present or absent but are given scores or weights according to their strength [9] . furthermore, different infections are passed by different routes, and a mixing network is infection specific. for instance, a network used in hiv transmission is different from the one used to examine influenza. similarly, in case of airborne infections like influenza and measles, various networks need to be considered because differing levels of interaction are required to constitute a contact. the problems with network definition and measurement imply that any mixing networks that are obtained will depend on the assumptions and protocols of the data collection process. three main standard techniques can be employed to gather such information, namely, infection searching, complete contact searching and diary-based studies [9] . after an epidemic spread, major emphasis is laid on determining the source and spread of infection. thus, each infected individual is linked to one other from whom infection is spread as well as from whom the infection is transmitted. as all connections represent actual transmission events, infection searching methods do not suffer from problems with the link definition, but interactions not responsible for this infection transmission are removed. thus, the networks observed are of closed architecture, without any loops, walks, cliques and complete sub-graphs [15] . infection searching is a preliminary method for infectious diseases with low prevalence. these can also be simulated using several mathematical techniques based on differential equations, control theories etc., assuming a homogeneous mixing of population. it can also be simulated in a manner so that infected individuals are identified and cured at a rate proportional to the number of neighbors it has, analogous to the infection process. but, it does not allow to compare various infection searching budgets and thus a discrete-event simulation need to be undertaken. moreover, a number of studies have shown that analyses based on realistic models of disease transmission in healthy networks yields significant projections of infection spread than projections created using compartmental models [8] . furthermore, depending on the number of contacts for any infected individuals, their susceptible neighbors are traced and removed. this is followed by identifying infection searching techniques that yields different numbers of newly infected individuals on the spread of the disease. contact searching identifies potential transmission contacts from an initially infected individual by revealing some new individual set who are prone to infection and can be subject of further searching effort. nevertheless, it suffers from network definition issues; is time consuming and depends on complete information about individuals and their relationships. it has been used as a control strategy, in case of stds. its main objective of contact searching is identifying asymptomatically infected individuals who are either treated or quarantined. complete contact searching deals with identifying the susceptible and/or infected individuals of already infected ones and conducting simulations and/or testing them for degree of infection spread, treating them as well as searching their neighbors for immunization. for instance, stds have been found to be difficult for immunization. the reason being, these have specifically long asymptomatic periods, during which the virus can replicate and the infection is transmitted to healthy, closely related neighbors. this is rapidly followed by severe effects, ultimately leading to the termination of the affected individual. likewise, recognizing these infections as global epidemic has led to the development of treatments that allow them to be managed by suppressing the replication of the infection for as long as possible. thus, complete contact searching act as an essential strategy even in case when the infection seems incurable [7] . diary-based studies consider individuals recording contacts as they occur and allow a larger number of individuals to be sampled in detail. thus, this variation from the population approach of other tracing methods to the individual-level scale is possible. but, this approach suffers from several disadvantages. for instance, the data collection is at the discretion of the subjects and is difficult for researchers to link this information into a comprehensive network, as the individual identifies contacts that are not uniquely recorded [3] . diary-based studies require the individuals to be part of some coherent group, residing in small communities. also, it is quite probable that this kind of a study may result in a large number of disconnected sub-groups, with each of them representing some locally connected set of individuals. diary-based studies can be beneficial in case of identifying infected and susceptible individuals as well as the degree of infectivity. these also provide a comprehensive network for diseases that spread by point-to-point contact and can be used to investigate the patterns infection spread. robustness is an essential connectivity property of power-law graph. it defines that power-law graphs are robust under random attack, but vulnerable under targeted attack. recent studies have shown that the robustness of power-law graph under random and targeted attacks are simulated displaying that power-law graphs are very robust under random errors but vulnerable when a small fraction of high degree vertices or links are removed. furthermore, some studies have also shown that if vertices are deleted at random, then as long as any positive proportion remains, the graph induced on the remaining vertices has a component of order of the total number of vertices [15] . many a times it can be observed that a network of individuals may be subject to sudden change in the internal and/or external environment, due to some perturbation events. for this reason, a balance needs to be maintained against perturbations while being adaptable in the presence of changes, a property known as robustness. studies on the topological and functional properties of such networks have achieved some progress, but still have limited understanding of their robustness. furthermore, more important a path is, higher is the chance to have a backup path. thus, removing a link or an individual from any sub-network may also lead to blocking the information flow within that sub-network. the robustness of a model can also be assessed by means of altering the various parameters and components associated with forming a particular link. robustness of a network can also be studied with respect to 'resilience', a method of analyzing the sensitivities of internal constituents under external perturbation, that may be random or targeted in nature [18] . basic disease models discuss the number of individuals in a population that are susceptible, infected and/or recovered from a particular infection. for this purpose, various differential equation based models have been used to simulate the events of action during the infection spread. in this scenario, various details of the infection progression are neglected, along with the difference in response between individuals. models of infections can be categorized as sir and susceptible, infected, susceptible (sis) [9] . the sir model considers individuals to have long-lasting immunity, and divides the population into those susceptible to the disease (s), infected (i) and recovered (r). thus, the total number of individuals (t ) considered in the population is the transition rate from s to i is κ and the recovery rate from i to r is ρ . thus, the sir model can be represented as likewise, the reproductivity (θ) of an infection can be identified as the average number of secondary instances a typical single infected instance will cause in a population with no immunity. it determines whether infections spreads through a population; if θ < 1, the infection terminates in the long run; θ > 1, the infection spreads in a population. larger the value of θ, more difficult is to control the epidemic [12] . furthermore, the proportion of the population that needs to be immunized can be calculated by known as endemic stability can be identified. depending upon these instances, immunization strategies can be initiated [6] . although the contact network in a general sir model can be arbitrarily complex, the infection dynamics can still being studied as well as modeled in a simple fashion. contagion probabilities are set to a uniform value, i.e., p, and contagiousness has a kind of 'on-off' property, i.e., an individual is equally contagious for each of the t i steps while it has the infection, where 1 is present state of the system. one can extend the idea that contagion is more likely between certain pairs of individuals or vertices by assigning a separate probability p v i ,v j to each pair of individuals or vertices v i and v j , for which v i is linked to v j in a directed contact network. likewise, other extensions of the contact model involves separating the i state into a sequence of early, middle, and late periods of the infection. for instance, it could be used to model an infection with a high contagious incubation period, followed by a less contagious period while symptoms are being expressed [16] . in most of the cases, sir epidemics are thought of dynamic processes, in which the network state evolves step-by-step over time. it captures the temporal dynamics of the infection as it spreads through a population. the sir model has been found to be suitable for infections, which provides lifelong immunity, like measles. in this case, a property termed as the force of infection is existent, a function of the number of infectious individuals is. it also contains information about the interactions between individuals that lead to the transmission of infection. one can also have a static view of the epidemics where sir model for t i = 1. this means that considering a point in an sir epidemic when a vertex v i has just become infectious, has one chance to infect v j (since t i = 1), with probability p. one can visualize the outcome of this probabilistic process and also assume that for each edge in the contact network, a probability signifying the relationship is identified. the sis model can be represented as removed state is absent in this case. moreover, after a vertex is over with the infectious state, it reverts back to the susceptible state and is ready to initiate the infection again. due to this alternation between the s and i states, the model is referred to as sis model. the mechanics of sis model can be discussed as follows [2] . 1. at the initial stage, some vertices remain in i state and all others are in s state. 2. each vertex v i that enters the i state and remains infected for a certain number of steps t i . 3. during each of these t i steps, v i has a probability p of passing the infection to each of its susceptible directly linked neighbors. 4. after t i steps, v i no longer remains infected, and returns back to the s state. the sis model is predominantly used for simulating and understanding the progress of stds, where repeat infections are existent, like gonorrhoea. moreover, certain assumptions with regard to random mixing between individuals within each pair of sub-networks are present. in this scenario, the number of neighbors for each individual is considerably smaller than the total population size. such models generally avoid random-mixing assumptions thereby assigning each individual to a specific set of contacts that they can infect. an sis epidemic, can run for long time duration as it can cycle through the vertices multiple number of times. if at any time during the sis epidemic all vertices are simultaneously free of the infection, then the epidemic terminates forever. the reason being, no infected individuals exist that can pass the infection to others. in case if the network is finite in nature, a stage would arise when all attempts for further infection of healthy individuals would simultaneously fail for t i steps in a row. likewise, for contact networks where the structure is mathematically tractable, a particular critical value of the contagion probability p is existent, an sis epidemic undergoes a rapid shift from one that terminates out quickly to one that persists for a long time. in this case, the critical value of the contagion probability depends on the structure of the problem set [1] . the patterns by which epidemics spread through vertex groups is determined by the properties of the pathogen, length of its infectious period, severity and the network structures. the path for an infection spread are given by a population state, with existence of direct contacts between the individuals or vertices. the functioning of network system depends on the nature of interaction between their individuals. this is essentially because of the effect of infection-causing individuals and topology of networks. to analyze the complexity of epidemics, it is important to understand the underlying principles of its distribution in the history of its existence. in recent years it has been seen that the study of disease dynamics in social networks is relevant with the spread of viruses and the nature of diseases [9] . moreover, the pathogen and the network are closely intertwined with even within the same group of individuals, the contact networks for two different infections are different structures. this depends on respective modes of transmission of infections. for instance, a highly contagious infection, involving airborne transmission, the contact network includes a huge number of links, including any pair of individuals that are in contact with one another. likewise, for an infection requiring close contact, the contact network is much sparser, with fewer pairs of individuals connected by links [7] . immunization is a site percolation problem where each immunized individual is considered to be a site which is removed from the infected network. its aim is to transfer the percolation threshold that leads to minimization of the number of infected individuals. the model of sir and immunization is regarded as a site-bond percolation model, and immunization is considered successful if the infected a network is below a predefined percolation threshold. furthermore, immunizing randomly selected individuals requires targeting a large fraction, frac, of the entire population. for instance, some infections require 80-100 % immunization. meanwhile, targetbased immunization of the hubs requires global information about the network in question, rendering it impractical in many cases, which is very difficult in certain cases [6] . likewise, social networks possess a broad distribution of the number of links, conn, connecting individuals and analyzing them illustrate that that a large fraction, frac, of the individuals need to be immunized before the integrity of the infected network is compromised. this is essentially true for scale-free networks, where p(conn) ≈ conn − η , 2 < η < 3, where the network remains connected even after removal of most of its individuals or vertices. in this scenario, a random immunization strategy requires that most of the individuals need to be immunized before an epidemic is terminated [8] . for various infections, it may be difficult to reach a critical level of immunization for terminating the infection. in this case, each individual that is immunized is given immunity against the infection, but also provides protection to other healthy individuals within the population. based on the sir model, one can only achieve half of the critical immunization level which reduces the level of infection in the population by half. a crucial property of immunization is that these strategies are not perfect and being immunized does not always confer immunity. in this case, the critical threshold applies to a portion of the total population that needs to be immunized. for instance, if the immunization fails to generate immunity in a portion, por, of those immunized, then to achieve immunity one needs to immunize a portion here, im denotes immunity strength. thus, in case if por is huge it is difficult to remove infection using this strategy or provides partial immunity. it may also invoke in various manners: the immunization reduces the susceptibility of an individual to a particular infection, may reduce subsequent transmission if the individual becomes infected, or it may increase recovery. such immunization strategies require the immunized individuals to become infected and shift into a separate infected group, after which the critical immunization threshold (s i ) needs to be established. thus, if cil is the number of secondary infected individuals affected by an initial infectious individual, then thus, s i needs to be less than one, else it is not possible to remove the infection. but, one also needs to note that an immunization works equally efficiently if it reduces the transmission or susceptibility and increases the recovery rate. moreover, when the immunization strategy fails to generate any protection in a proportion por of those immunized, the rest 1−por are fully protected. in this scenario, it can be not possible to remove the infection using random immunization. thus, targeted immunization provides better protection than random-based [13] . in case of homogenous networks, the average degree, conn, fluctuates less and can assume conn conn, i.e., the number of links are approximately equal to average degree. however, networks can also be heterogeneous. likewise, in a homogeneous network such as a random graph, p(conn) decays faster exponentially whereas for heterogenous networks it decays as a power law for large conn. the effect of heterogeneity on epidemic behavior studied in details for many years for scale-free networks. these studies are mainly concerned with the stationary limit and existence of an endemic phase. an essential result of this analysis is the expression of basic reproductive number which in this case is τ ∞ conn 2 conn . here, τ is proportional to the second moment of degree, which finally diverges for increasing network sizes [15] . it has been noticed that the degree of interconnection in between individuals for all form of networks is quite unprecedented. whereas, interconnection increases the spread of information in social networks, another exhaustively studied area contributes to the spread of infection throughout the healthy network. this rapid spreading is done due to less stringency of its passage through the network. moreover, initial sickness nature and time of infection are unavailable most of the time, and the only available information is related to the evolution of the sick-reporting process. thus, given complete knowledge of the network topology, the objective is to determine if the infection is an epidemic, or if individuals have become infected via an independent infection mechanism that is external to the network, and not propagated through the connected links. if one considers a computer network undergoing cascading failures due to worm propagation whereas random failures due to misconfiguration independent of infected nodes, there are two possible causes of the sickness, namely, random and infectious spread. in case of random sickness, infection spreads randomly and uniformly over the network where the network plays no role in spreading the infection; and infectious spread, where the infection is caused through a contagion that spreads through the network, with individual nodes being infected by direct neighbors with a certain probability [6] . in random damage, each individual becomes infected with an independent probability ψ 1 . at time t, each infected individual reports damage with an independent probability ψ 2 . thus, on an average, a fraction ψ of the network reports being infected, where it is already known that social networks possess a broad distribution of the number of links, k, originating from an individual. computer networks, both physical and logical are also known to possess wide, scale-free, distributions. studies of percolation on broad-scale networks display that a large fraction, fc, of the individuals need to be immunized before the integrity of the network is compromised. this is particularly true for scale-free networks, where the percolation threshold tends to 1, and the network remains contagious even after removal of most of its infected individuals [9] . when the hub individuals are targeted first, removal of just a fraction of these results in the breakdown of the network. this has led to the suggestion of targeted immunization of hubs. to implement this approach, the number for connections of each individual needs to be known. during infection spread, at time 0, a randomly selected individual in the network becomes infected. when a healthy individual becomes infected, a time is set for each outgoing link to an adjacent individual that is not infected, with expiration time exponentially distributed with unit average. upon expiration of a link's time, the corresponding individual becomes infected, and in-turn begins infecting its neighbors [7] . in general, for an epidemic to occur in a susceptible population the basic reproductive rate must be greater than 1. in many circumstances not all contacts will be susceptible to infection. in this case, some contacts remain immune, due to prior infection which may have conferred life-long immunity, or due to some previous immunization. therefore, not all individuals are infected and the average number of secondary infections decrease. similarly, the epidemic threshold in this case is the number of susceptible individuals within a population that is required for an epidemic to occur. similarly, the herd immunity is the proportion of population immune to a particular infection. if this is achieved due to immunization, then each case leads to a new case and the infection becomes more stable within the population [6] . one of the simplest immunization procedure consists of random introduction of immune individuals in the population for achieving uniform immunization density. in this case, for a fixed spreading rate, ξ , the relevant control parameter in the density of immune individuals present in the network, the immunity, imm. at the meanfield level, the presence of a uniform immunity reduces ξ by a factor 1 − imm, i.e., the probability of identifying and infecting a susceptible and non-immune individual becomes ξ(1−imm). for homogeneous networks, one observes that, for aconstant ξ , the stationary prevalence is given by for imm > imm c and for imm ≤ imm c here imm c is the critical immunization value above which the density of infected individuals in the stationary state is null and depends on ξ as thus, for a uniform immunization level larger than imm c , the network is completely protected and no large epidemic outbreaks are possible. on the contrary, uniform immunization strategies on scale-free heterogenous networks are totally ineffective. the presence of uniform immunization elocally depresses the infections prevalence for any value of ξ , and it is difficult to identify any critical fraction of immunized individuals that ensures the eradication of infection [2] . cascading, or epidemic processes are those where the actions, infections or failure of certain individuals increase the susceptibility of others. this results in the successive spread of infections from a small set of initially infected individuals to a larger set. initially developed as a way to study human disease propagation, cascades ares useful models in a wide range of application. the vast majority of work on cascading processes focused on understanding how the graph structure of the network affects the spread of cascades. one can also focus on several critical issues for understanding the cascading features in network for which studying the architecture of the network is crucial [5] . the standard independent cascade epidemic model assumes that the network is directed graph g = (v, e), for every directed edge between v i , v j , we say v i is a parent and v j is a child of the corresponding other vertex. parent may infect child along an edge, but the reverse cannot happen. let v denote the set of parents of each vertex v i , and for convenience v i ∈ v is included. epidemics proceed in discrete time where all vertices are initially in the susceptible state. at time 0, each vertex independently becomes active, with probability p init . this set of initially active vertices are called 'seeds'. in each time step, the active vertices probabilistically infects its susceptible children; if vertex v i is active at time t, it infects each susceptible child v j with probability p v i vj , independently. correspondingly, a vertex v j susceptible at time t becomes active in the next time step, i.e., at time t + 1, if any one of its parents infects it. finally, a vertex remains active for only one time slot, after which it becomes inactive and does not spread the infection further as well as cannot be infected again either [5] . thus, in this kind of an sir epidemic, where some vertices remain forever susceptible because the epidemic never reaches them, while others transition, susceptible → active for one time step → inactive. in this chapter, we discussed some critical issues regarding epidemics and their outbursts in static as well as dynamic network structures. we mainly focused on sir and sis models as well as identifying key strategies for identifying the damage caused in networks. we also discussed the various modeling techniques for studying cascading failures. epidemics pass through populations and persists over long time periods. thus, efficient modeling of the underlying network plays a crucial role in understanding the spread and prevention of an epidemic. social, biological, and communication systems can be explained as complex networks with their degree distribution follows a power law, p(conn) ≈ conn − η , for the number of connections, conn for individuals, representing scale-free (sf) networks. we also discussed certain issues on epidemic spreading in sf networks characterized by complex topologies with basic epidemic models describing the proportion of individuals susceptible, infected and recovered from a particular disease. likewise, we also explained the significance of the basic reproduction rate of an infection, that can be identified as the average number of secondary instances a typical single infected instance will cause in a population with no immunity. also, we explained how determining the complete nature of a network required knowledge of every individual in a population and their relationships as, the problems with network definition and measurement depend on the assumptions of data collection processes. nevertheless, we also illustrated the importance of invasion resistance methods, with temporary immunity generating oscillations in localized parts of the network, with certain patches following large numbers of infections in concentrated areas. similarly, we also explained the significance of damages, namely, random, where the damage spreads randomly and uniformly over the network and in particular the network plays no role in spreading the damage; and infectious spread, where the damage spreads through the network, with one node infecting others with some probability. infectious diseases of humans: dynamics and control the mathematical theory of infectious diseases and its applications a forest-fire model and some thoughts on turbulence emergence of scaling in random networks mathematical models used in the study of infectious diseases spread of epidemic disease on networks networks and epidemic models network-based analysis of stochastic sir epidemic models with random and proportionate mixing elements of mathematical ecology intelligent information and database systems propagation phenomenon in complex networks: theory and practice relation between birth rates and death rates the analysis of malaria epidemics graph theory and networks in biology mathematical biology spread of epidemic disease on networks the use of mathematical models in the epidemiology study of infectious diseases and in the design of mass vaccination programmes forest-fire as a model for the dynamics of disease epidemics on the critical behaviour of simple epidemics sensitivity estimates for nonliner mathematical models ensemble modeling of metabolic networks on analytical approaches to epidemics on networks computational modeling in systems biology collective dynamics of 'small-world' networks unifying wildfire models from ecology and statistical physics key: cord-000196-lkoyrv3s authors: salathé, marcel; jones, james h. title: dynamics and control of diseases in networks with community structure date: 2010-04-08 journal: plos comput biol doi: 10.1371/journal.pcbi.1000736 sha: doc_id: 196 cord_uid: lkoyrv3s the dynamics of infectious diseases spread via direct person-to-person transmission (such as influenza, smallpox, hiv/aids, etc.) depends on the underlying host contact network. human contact networks exhibit strong community structure. understanding how such community structure affects epidemics may provide insights for preventing the spread of disease between communities by changing the structure of the contact network through pharmaceutical or non-pharmaceutical interventions. we use empirical and simulated networks to investigate the spread of disease in networks with community structure. we find that community structure has a major impact on disease dynamics, and we show that in networks with strong community structure, immunization interventions targeted at individuals bridging communities are more effective than those simply targeting highly connected individuals. because the structure of relevant contact networks is generally not known, and vaccine supply is often limited, there is great need for efficient vaccination algorithms that do not require full knowledge of the network. we developed an algorithm that acts only on locally available network information and is able to quickly identify targets for successful immunization intervention. the algorithm generally outperforms existing algorithms when vaccine supply is limited, particularly in networks with strong community structure. understanding the spread of infectious diseases and designing optimal control strategies is a major goal of public health. social networks show marked patterns of community structure, and our results, based on empirical and simulated data, demonstrate that community structure strongly affects disease dynamics. these results have implications for the design of control strategies. mitigating or preventing the spread of infectious diseases is the ultimate goal of infectious disease epidemiology, and understanding the dynamics of epidemics is an important tool to achieve this goal. a rich body of research [1, 2, 3] has provided major insights into the processes that drive epidemics, and has been instrumental in developing strategies for control and eradication. the structure of contact networks is crucial in explaining epidemiological patterns seen in the spread of directly transmissible diseases such as hiv/aids [1, 4, 5] , sars [6, 7] , influenza [8, 9, 10, 11] etc. for example, the basic reproductive number r 0 , a quantity central to developing intervention measures or immunization programs, depends crucially on the variance of the distribution of contacts [1, 12, 13] , known as the network degree distribution. contact networks with fat-tailed degree distributions, for example, where a few individuals have an extraordinarily large number of contacts, result in a higher r 0 than one would expect from contact networks with a uniform degree distribution, and the existence of highly connected individuals makes them an ideal target for control measures [7, 14] . while degree distributions have been studied extensively to understand their effect on epidemic dynamics, the community structure of networks has generally been ignored. despite the demonstration that social networks show significant community structure [15, 16, 17, 18] , and that social processes such as homophily and transitivity result in highly clustered and modular networks [19] , the effect of such microstructures on epidemic dynamics has only recently started to be investigated. most initial work has focused on the effect of small cycles, predominantly in the context of clustering coefficients (i.e. the fraction of closed triplets in a contact network) [20, 21, 22, 23, 24] . in this article, we aim to understand how community structure affects epidemic dynamics and control of infectious disease. community structure exists when connections between members of a group of nodes are more dense than connections between members of different groups of nodes [15] . the terminology is relatively new in network analysis and recent algorithm development has greatly expanded our ability to detect sub-structuring in networks. while there has been a recent explosion in interest and methodological development, the concept is an old one in the study of social networks where it is typically referred to as a ''cohesive subgroups,'' groups of vertices in a graph that share connections with each other at a higher rate than with vertices outside the group [18] . empirical data on social structure suggests that community structuring is extensive in epidemiological contacts [25, 26, 27] relevant for infectious diseases transmitted by the respiratory or close-contact route (e.g. influenza-like illnesses), and in social groups more generally [16, 17, 28, 29, 30] . similarly, the results of epidemic models of directly transmitted infections such as influenza are most consistent with the existence of such structure [8, 9, 11, 31, 32, 33] . using both simulated and empirical social networks, we show how community structure affects the spread of diseases in networks, and specifically that these effects cannot be accounted for by the degree distribution alone. the main goal of this study is to demonstrate how community structure affects epidemic dynamics, and what strategies are best applied to control epidemics in networks with community structure. we generate networks computationally with community structure by creating small subnetworks of locally dense communities, which are then randomly connected to one another. a particular feature of such networks is that the variance of their degree distribution is relatively low, and thus the spread of a disease is only marginally affected by it [34] . running standard susceptible-infected-resistant (sir) epidemic simulations (see methods) on these networks, we find that the average epidemic size, epidemic duration and the peak prevalence of the epidemic are strongly affected by a change in community structure connectivity that is independent of the overall degree distribution of the full network ( figure 1 ). note that the value range of q shown in figure 1 is in agreement with the value range of q found in the empirical networks used further below, and that lower values of q do not affect the results qualitatively (see suppl. mat. figure s1 ). epidemics in populations with community structure show a distinct dynamical pattern depending on the extent of community structure. in networks with strong community structure, an infected individual is more likely to infect members of the same community than members outside of the community. thus, in a network with strong community structure, local outbreaks may die out before spreading to other communities, or they may spread through various communities in an almost serial fashion, and large epidemics in populations with strong community structure may therefore last for a long time. correspondingly, the incidence rate can be very low, and the number of generations of infection transmission can be very high, compared to the explosive epidemics in populations with less community structure (figures 2a and 2b ). on average, epidemics in networks with strong community structure exhibit greater variance in final size (figures 2c and 2d) , a greater number of small, local outbreaks that do not develop into a full epidemic, and a higher variance in the duration of an epidemic. in order to halt or mitigate an epidemic, targeted immunization interventions or social distancing interventions aim to change the structure of the network of susceptible individuals in such a way as to make it harder for a pathogen to spread [35] . in practice, the number of people to be removed from the susceptible class is often constrained for a number of reasons (e.g., due to limited vaccine supply or ethical concerns of social distancing measures). from a network perspective, targeted immunization methods translate into indentifying which nodes should be removed from a network, a problem that has caught considerable attention (see for example [36] and references therein). targeting highly connected individuals for immunization has been shown to be an effective strategy for epidemic control [7, 14] . however, in networks with strong community structure, this strategy may not be the most effective: some individuals connect to multiple communities (so-called community bridges [37] ) and may thus be more important in spreading the disease than individuals with fewer inter-community connections, but this importance is not necessarily reflected in the degree. identification of community bridges can be achieved using understanding the spread of infectious diseases in populations is key to controlling them. computational simulations of epidemics provide a valuable tool for the study of the dynamics of epidemics. in such simulations, populations are represented by networks, where hosts and their interactions among each other are represented by nodes and edges. in the past few years, it has become clear that many human social networks have a very remarkable property: they all exhibit strong community structure. a network with strong community structure consists of smaller sub-networks (the communities) that have many connections within them, but only few between them. here we use both data from social networking websites and computer generated networks to study the effect of community structure on epidemic spread. we find that community structure not only affects the dynamics of epidemics in networks, but that it also has implications for how networks can be protected from large-scale epidemics. the betweenness centrality measure [38] , defined as the fraction of shortest paths a node falls on. while degree and betweenness centrality are often strongly positively correlated, the correlation between degree and betweenness centrality becomes weaker as community structure becomes stronger ( figure 3 ). thus, in networks with community structure, focusing on the degree alone carries the risk of missing some of the community bridges that are not highly connected. indeed, at a low vaccination coverage, an immunization strategy based on betweenness centrality results in fewer infected cases than an immunization strategy based on degree as the magnitude of community structure increases ( figure 4a ). this observation is critical because the potential vaccination coverage for an emerging infection will typically be very low. a third measure, random walk centrality, identifies target nodes by a random walk, counting how often a node is traversed by a random walk between two other nodes [39] . the random walk centrality measure considers not only the shortest paths between pairs of nodes, but all paths between pairs of nodes, while still giving shorter paths more weight. while infections are most likely to spread along the shortest paths between any two nodes, the cumulative contribution of other paths can still be important [40] : immunization strategies based on random walk centrality result in the lowest number of infected cases at low vaccination coverage (figure 4b and 4c ). to test the efficiency of targeted immunization strategies on real networks, we used interaction data of individuals at five different universities in the us taken from a social network website [41] , and obtained the contact network relevant for directly transmissible diseases (see methods). we find again that the overall most successful targeted immunization strategy is the one that identifies the targets based on random walk centrality. limited immunization based on random walk centrality significantly outperforms immunization based on degree especially when vaccination coverage is low (figure 5a ). in practice, identifying immunization targets may be impossible using such algorithms, because the structure of the contact network relevant for the spread of a directly transmissible disease is generally not known. thus, algorithms that are agnostic about the full network structure are necessary to identify target individuals. the only algorithm we are aware of that is completely agnostic about the network structure network structure identifies target nodes by picking a random contact of a randomly chosen individual [42] . once such an acquaintance has been picked n times, it is immunized. the acquaintance method has been shown to be able to identify some of the highly connected individuals, and thus approximates an immunization strategy that targets highly connected individuals. we propose an alternative algorithm (the so-called community bridge finder (cbf) algorithm, described in detail in the methods) that aims to identify community bridges connecting two groups of clustered nodes. briefly, starting from a random node, the algorithm follows a random path on the contact network, until it arrives at a node that does not connect back to more than one of the previously visited nodes on the random walk. the basic goal of the cbf algorithm is to find nodes that connect to multiple communities -it does so based on the notion that the first node that does not connect back to previously visited nodes of the current random walk is likely to be part of a different community. on all empirical and computationally generated networks tested, this algorithm performed mostly better, often equally well, and rarely worse than the alternative algorithm. it is important to note a crucial difference between algorithms such as cbf (henceforth called stochastic algorithms) and algorithms such as those that calculate, for example, the betweenness centrality of nodes (henceforth called deterministic algorithms). a deterministic algorithm always needs the complete information about each node (i.e. either the number or the identity of all connected nodes for each node in the network). a comparison between algorithms is therefore of limited use if they are not of the same type as they have to work with different inputs. clearly, a deterministic algorithm with information on the full network structure as input should outperform a stochastic algorithm that is agnostic about the full network structure. thus, we will restrict our comparison of cbf to the acquaintance method since this is the only stochastic algorithm we are aware of the takes as input the same limited amount of local information. in the computationally generated networks, cbf outperformed the acquaintance method in large areas of the parameter space ( figure 4d ). it may seem unintuitive at first that the acquaintance method outperforms cbf at very high values of modularity, but one should keep in mind that epidemic sizes are very small in those extremely modular networks (see figure 1a ) because local outbreaks only rarely jump the community borders. if outbreaks are mostly restricted to single communities, then cbf is not the optimal strategy because immunizing community bridges is useless; the acquaintance method may at least find some well connected nodes in each community and will thus perform slightly better in this extreme parameter space. in empirical networks, cbf did particularly well on the network with the strongest community structure (oklahoma), especially in comparison to the similarly effective acquaintance method with n = 2. (figure 5c ). as immunization strategies should be deployed as fast as possible, the speed at which a certain fraction of the . assessing the efficacy of targeted immunization strategies based on deterministic and stochastic algorithms in the computationally generated networks. color code denotes the difference in the average final size s m of disease outbreaks in networks that were immunized before the outbreak using method m. the top panel (a) shows the difference between the degree method and the betweenness centrality method, i.e. s degree 2 s betweenness . a positive difference (colored red to light gray) indicates that the betweenness centrality method resulted in smaller final sizes than the degree method. a negative difference (colored blue to black) indicates that the betweenness centrality method resulted in bigger final sizes than the degree method. if the difference is not bigger than 0.1% of the total population size, then no color is shown (white). panel (a) shows that the betweenness centrality method is more effective than the degree based method in networks with strong community structure (q is high). (b) and (c): like (a), but showing s degree 2 s randomwalk (in (b)) and s betweenness 2 s randomwalk (in (c)). panels (b) and (c) show that the random walk method is the most effective method overall. panel (d) shows that the community bridge finder (cbf) method generally outperforms the acquaintance method (with n = 1) except when community structure is very strong (see main text). final epidemic sizes were obtained by running 2000 sir simulations per network, vaccination coverage and immunization method. doi:10.1371/journal.pcbi.1000736.g004 network can be immunized is an additional important aspect. we measured the speed of the algorithm as the number of nodes that the algorithm had to visit in order to achieve a certain vaccination coverage, and find that the cbf algorithm is faster than the similarly effective acquaintance method with n = 2 at vaccination coverages ,30% (see figure 6 ). a great number of infectious diseases of humans spread directly from one person to another person, and early work on the spread of such diseases has been based on the assumption that every infected individual is equally likely to transmit the disease to any susceptible individual in a population. one of the most important consequences of incorporating network structure into epidemic models was the demonstration that heterogeneity in the number of contacts (degree) can strongly affect how r 0 is calculated [12, 13, 34] . thus, the same disease can exhibit markedly different epidemic patterns simply due to differences in the degree distribution. our results extend this finding and show that even in networks with the same degree distribution, fundamentally different epidemic dynamics are expected to be observed due to different levels of community structure. this finding is important for various reasons: first, community structure has been shown to be a crucial feature of social networks [15, 16, 17, 19] , and its effect on disease spread is thus relevant to infectious disease dynamics. furthermore, it corroborates earlier suggestions that community structure affects the spread of disease, and is the first to clearly isolate this effect from effects due to variance in the degree distribution [43] . second, and consequently, data on the degree distribution of contact networks will not be sufficient to predict epidemic dynamics. third, the design of control strategies benefits from taking community structure into account. an important caveat to mention is that community structure in the sense used throughout this paper (i.e. measured as modularity q ) does not take into account explicitly the extent to which communities overlap. such overlap is likely to play an important role in infectious disease dynamics, because people are members of multiple, potentially overlapping communities (households, schools, workplaces etc.). a strong overlap would likely be reflected in lower overall values for q; however, the exact effect of community overlap on infectious disease dynamics remains to be investigated. identifying important nodes to affect diffusion on networks is a key question in network theory that pertains to a wide range of fields and is not limited to infectious disease dynamics only. there are however two major issues associated with this problem: (i) the structure of networks is often not known, and (ii) many networks are too large to compute, for example, centrality measures efficiently. stochastic algorithms like the proposed cbf algorithm or the acquaintance method address both problems at once. to what extent targeted immunization strategies can be implemented in a infectious diseases/public health setting based on practical and ethical considerations remains an open question. this is true not only for the strategy based on the cbf algorithm, but for most strategies that are based on network properties. as mentioned above, the contact networks relevant for the spread of infectious diseases are generally not known. stochastic algorithms such as the cbf or the acquaintance method are at least in principle applicable when data on network structure is lacking. community structure in host networks is not limited to human networks: animal populations are often divided into subpopulations, connected by limited migration only [44, 45] . targeted immunization of individuals connecting subpopulations has been shown to be an effective low-coverage immunization strategy for the conservation of endangered species [46] . under the assumption of homogenous mixing, the elimination of a disease requires an immunization coverage of at least 1-1/r 0 [1] but such coverage is often difficult or even impossible to achieve due to limited vaccine supply, logistical challenges or ethical concerns. in the case of wildlife animals, high vaccination coverage is also problematic as vaccination interventions can be associated with substantial risks. little is known about the contact network structure in humans, let alone in wildlife, and progress should therefore be made on the development of immunization strategies that can deal with the absence of such data. stochastic algorithms such as the acquaintance method and the cbf method are first important steps in addressing the problem, but the large difference in efficacy between stochastic and deterministic algorithms demonstrates that there is still a long way to go. to investigate the spread of an infectious disease on a contact network, we use the following methodology: individuals in a population are represented as nodes in a network, and the edges between the nodes represent the contacts along which an infection can spread. contact networks are abstracted by undirected, unweighted graphs (i.e. all contacts are reciprocal, and all contacts transmit an infection with the same probability). edges always link between two distinct nodes (i.e. no self loops), and there must be maximally one edge between any single pair of nodes (i.e no parallel edges). each node can be in one of three possible states: (s)usceptible, (i)nfected, or (r)esistant/immune (as in standard sir models). initially, all nodes are susceptible. simulations with immunization strategies implement those strategies before the first infection occurs. targeted nodes are chosen according to a given immunization algorithm (see below) until a desired immunization coverage of the population is achieved, and then their state is set to resistant. after this initial set-up, a random susceptible node is chosen as patient zero, and its state is set to infected. then, during a number of time steps, the initial infection can spread through the network, and the simulation is halted once there are no further infected nodes. at each time step (the unit of time we use is one day, i.e. a figure 5 . assessing the efficacy of targeted immunization strategies in empirical networks based on deterministic and stochastic algorithms. the bars show the difference in the average final size s m of disease outbreaks (n cases) in networks that were immunized before the outbreak using method m. the left panels show the difference between the degree method and the random walk centrality method, i.e. s degree 2 s randomwalk . if the difference is positive (red bars), then the random walk centrality method resulted in smaller final sizes than the degree method. a negative value (black bars) means that the opposite is true. shaded bars show non-significant differences (assessed at the 5% level using the mann-whitney test). the middle and right panels are generated using the same methodology, but measuring the difference between the acquaintance method (with n = 1 in the middle column and n = 2 in the right column, see methods) and the community bridge finder (cbf) method, i.e. s acquaintance1 2 s cbf and s acquaintance2 2 s cbf . again, positive red bars mean that the cbf method results in smaller final sizes, i.e. prevents more cases, than the acquaintance methods, whereas negative black bars mean the opposite. final epidemic sizes were obtained by running 2000 sir simulations per network, vaccination coverage and immunization method. doi:10.1371/journal.pcbi.1000736.g005 time step is one day), an infected node can get infected with probability 12exp(2bi), where b is the transmission rate from an infected to a susceptible node, and i is the number of infected neighboring nodes. at each time step, infected nodes recover at rate c, i.e. the probability of recovery of an infected node per time step is c (unless noted otherwise, we use c = 0.2). if recovery occurs, the state of the recovered node is toggled from infected to resistant. unless mentioned otherwise, the transmission rate b is chosen such that r 0 ,(b/c) * d<3 where d is the mean network degree, i.e the average number of contacts per node. for the networks used here, this approximation is in line with the result from static network theory [47] , r 0 ,t(,k 2 ./,k.21), where ,k. and ,k 2 . are the mean degree and mean square degree, respectively, and where t is the average probability of disease transmission from a node to a neighboring node, i.e. t σ. we compute them through the traces of hai (tr½hai ¼ α þ σ and tr½hai 2 ¼ α 2 þ σ 2 ) to obtain the expression of ρ½hai for eq. (13): the epidemic threshold becomes yielding the same result of ref. [35] , provided here that the transmission rate λ is multiplied by δ to make it a probability, as in ref. [35] . finally, we verify that for the trivial example of static networks, with an adjacency matrix constant in time, eq. (13) reduces immediately to the result of refs. [17, 18] . we now validate our analytical prediction against numerical simulations on two synthetic models. the first is the activity-driven model with activation rate a i ¼ a, m ¼ 1, and average interactivation time τ ¼ 1=a ¼ 1, fixed as the time unit of the simulations. the transmission parameter is the probability upon contact λδ and the model is implemented in continuous time. the second model is based on a bursty interactivation time distribution pðδtþ ∼ ðϵ þ δtþ −β [31] , with β ¼ 2.5 and ϵ tuned to obtain the same average interactivation time as before, τ ¼ 1. we simulate a sis spreading process on the two networks with four different recovery rates, μ ∈ f10 −3 ; 10 −2 ; 10 −1 ; 1g, i.e., ranging from a value that is 3 orders of magnitude larger than the time scale τ of the networks (slow disease), to a value equal to τ (fast disease). we compute the average simulated endemic prevalence for specific values of λ, μ using the quasistationary method [61] and compare the threshold computed with eq. (13) with the simulated critical transition from extinction to endemic state. as expected, we find eq. (13) to hold for the activity-driven model at all time scales of the epidemic process (fig. 2) , as the network lacks temporal correlations. the agreement with the transition observed in the bursty model, however, is recovered only for slow diseases, as at those time scales the network is found in the annealed regime. when network and disease time scales become comparable, the weakly commuting approximation of eq. (13) no longer holds, as burstiness results in dynamical correlations in the network evolution [31] . our theory offers a novel mathematical framework that rigorously connects discrete-time and continuous-time critical behaviors of spreading processes on temporal networks. it uncovers a coherent transition from an adjacency tensor to a tensor field resulting from a limit performed on the structural representation of the network and contagion process. we derive an analytic expression of the infection propagator in the general case that assumes a closed-form solution in the introduced class of weakly commuting networks. this allows us to provide a rigorous mathematical interpretation of annealed networks, encompassing the different definitions historically introduced in the literature. this work also provides the basis for important theoretical extensions, assessing, for example, the impact of bursty activation patterns or of the adaptive dynamics in response to the circulating epidemic. finally, our approach offers a tool for applicative studies on the estimation of the vulnerability of temporal networks to contagion processes in many real-world scenarios, for which the discrete-time assumption would be inadequate. we thank luca ferreri and mason porter for fruitful discussions. this work is partially sponsored by the ec-health contract no. 278433 (predemics) and the anr contract no. anr-12-monu-0018 (harmsflu) to v. c., and the ec-anihwa contract no. anr-13-anwa-0007-03 (liveepi) to e. v., c. p., and v. c. * present address: department d'enginyeria informàtica i matemàtiques modeling infectious diseases in humans and animals generalization of epidemic theory: an application to the transmission of ideas epidemics and rumours epidemic spreading in scale-free networks a simple model of global cascades on random networks modelling dynamical processes in complex socio-technical systems contact interactions on a lattice on the critical behavior of the general epidemic process and dynamical percolation cascade dynamics of complex propagation propagation and immunization of infection on general networks with both homogeneous and heterogeneous components dynamics of rumor spreading in complex networks kinetics of social contagion critical behaviors in contagion dynamics epidemic processes in complex networks resilience of the internet to random breakdowns spread of epidemic disease on networks epidemic spreading in real networks: an eigenvalue viewpoint discrete time markov chain approach to contact-based disease spreading in complex networks modern temporal network theory: a colloquium impact of non-poissonian activity patterns on spreading processes disease dynamics over very different time-scales: foot-and-mouth disease and scrapie on the network of livestock movements in the uk epidemic thresholds in dynamic contact networks how disease models in static networks can fail to approximate disease in dynamic networks representing the uk's cattle herd as static and dynamic networks impact of human activity patterns on the dynamics of information diffusion small but slow world: how network topology and burstiness slow down spreading dynamical strength of social ties in information spreading high-resolution measurements of face-to-face contact patterns in a primary school dynamical patterns of cattle trade movements multiscale analysis of spreading in a large communication network bursts of vertex activation and epidemics in evolving networks interplay of network dynamics and heterogeneity of ties on spreading dynamics predicting and controlling infectious disease epidemics using temporal networks, f1000prime rep the dynamic nature of contact networks in infectious disease epidemiology activity driven modeling of time varying networks temporal percolation in activity-driven networks contrasting effects of strong ties on sir and sis processes in temporal networks monogamous networks and the spread of sexually transmitted diseases epidemic dynamics on an adaptive network effect of social group dynamics on contagion epidemic threshold and control in a dynamic network virus propagation on time-varying networks: theory and immunization algorithms analytical computation of the epidemic threshold on temporal networks infection propagator approach to compute epidemic thresholds on temporal networks: impact of immunity and of limited temporal resolution machine learning: ecml effects of time window size and placement on the structure of an aggregated communication network epidemiologically optimal static networks from temporal network data limitations of discrete-time approaches to continuous-time contagion dynamics mathematical formulation of multilayer networks langevin approach for the dynamics of the contact process on annealed scale-free networks thresholds for epidemic spreading in networks controlling contagion processes in activity driven networks beyond the locally treelike approximation for percolation on real networks a course of modern analysis some results in floquet theory, with application to periodic epidemic models the magnus expansion and some of its applications the radiation theories of tomonaga, schwinger, and feynman optimal disorder for segregation in annealed small worlds diffusion in scale-free networks with annealed disorder recurrent outbreaks of measles, chickenpox and mumps: i. seasonal variation in contact rates epidemic thresholds of the susceptible-infected-susceptible model on networks: a comparison of numerical and theoretical results key: cord-290033-oaqqh21e authors: georgalakis, james title: a disconnected policy network: the uk's response to the sierra leone ebola epidemic date: 2020-02-13 journal: soc sci med doi: 10.1016/j.socscimed.2020.112851 sha: doc_id: 290033 cord_uid: oaqqh21e this paper investigates whether the inclusion of social scientists in the uk policy network that responded to the ebola crisis in sierra leone (2013–16) was a transformational moment in the use of interdisciplinary research. in contrast to the existing literature, that relies heavily on qualitative accounts of the epidemic and ethnography, this study tests the dynamics of the connections between critical actors with quantitative network analysis. this novel approach explores how individuals are embedded in social relationships and how this may affect the production and use of evidence. the meso-level analysis, conducted between march and june 2019, is based on the traces of individuals' engagement found in secondary sources. source material includes policy and strategy documents, committee papers, meeting minutes and personal correspondence. social network analysis software, ucinet, was used to analyse the data and netdraw for the visualisation of the network. far from being one cohesive community of experts and government officials, the network of 134 people was weakly held together by a handful of super-connectors. social scientists’ poor connections to the government embedded biomedical community may explain why they were most successful when they framed their expertise in terms of widely accepted concepts. the whole network was geographically and racially almost entirely isolated from those affected by or directly responding to the crisis in west africa. nonetheless, the case was made for interdisciplinarity and the value of social science in emergency preparedness and response. the challenge now is moving from the rhetoric to action on complex infectious disease outbreaks in ways that value all perspectives equally. global health governance is increasingly focused on epidemic and pandemic health emergencies that require an interdisciplinary approach to accessing scientific knowledge to guide preparedness and crisis response. of acute concern is zoonotic disease, that can spread from animals to humans and easily cross borders. the "grave situation" of the chinese coronavirus (covid-19) outbreak seems to have justified these fears and is currently the focus of an international mobilisation of scientific and state resources (wood, 2020) . covid-19 started in wuhan, the capital of china's hubei province and has been declared a public health emergency of international concern (pheic) by the world health organisation (who). the interactions currently taking place, nationally and internationally between evidence, policy and politics, are complex and relate to theories around the role of the researcher as broker or advocate and the form and function of research policy networks (pielk, 2007) and (ward et al., 2011) and (georgalakis and rose, 2019) . in this paper i seek to explore these areas further through the lens of the uk's response to ebola in west africa. this policy context has been selected in relation to the division of the affected countries between key donors. the british government assumed responsibility for sierra leone and sought guidance from health officials, academics, humanitarian agencies and clinicians. the ebola epidemic that struck west africa in 2013 has been described as a "transformative moment for global health" (kennedy and nisbett, 2015, p.2) , particularly in relation to the creation of a transdisciplinary response that was meant to take into account cultural practices and the needs of communities. the mobilisation of anthropological perspectives towards enhancing the humanitarian intervention was celebrated as an example of research impact by the uk's economic and social research council (esrc) and department for international development (dfid) (esrc, 2016 ). an eminent group of social scientists called for future global emergency health interventions to learn from this critical moment of interdisciplinary cooperation and mutual understanding (s. a. abramowitz et al., 2015) . however, there has been much criticism of this narrative, ranging from the serious https://doi.org/10.1016/j.socscimed.2020.112851 received 13 august 2019; received in revised form 6 february 2020; accepted 10 february 2020 * director of communications and impact, institute of development studies university of sussex, library road, falmer, brighton, bn1 9re, uk. e-mail addresses: j.georgalakis@ids.ac.uk, mjcg20@bath.ac.uk. available online 13 february 2020 0277-9536/ crown copyright © 2020 published by elsevier ltd. all rights reserved. t doubts of some anthropologists themselves about their impact (martineau et al., 2017) , to denouncements of largely european and north american anthropologists' legitimacy and the utility of their advice (benton, 2017) . there are two questions i hope to address through a critical commentary on the events that unfolded and with social network analysis of the uk based research and policy network that emerged: i) how transformational was the uk policy response to ebola in relation to changes in evidence use patterns and behaviours? ii) how does the form and function of the uk policy network relate to epistemic community theory? the first question will explore the degree to which social scientists and specifically anthropologists and medical anthropologists, were incorporated into the uk policy network. the second question seeks to locate the dynamics of this network in the literature on network theory and the role of epistemic communities in influencing policy during emergencies. the paper does not attempt to evidence the impact of anthropology in the field or take sides in hotly debated issues such as support for home care. instead, it looks at how individuals are embedded in social relationships and how this may affect the production and use of evidence (victor et al., 2017) . the emerging field of network analysis around the generation and uptake of evidence in policy, recommends this critical realist constructivist methodology. it utilises interactive theories of evidence use, the study of whole networks and the analysis of the connections between individuals in policy and research communities (nightingale and cromby, 2002; oliver and faul, 2018) . although ebola related academic networks have been mapped, this methodological approach has never previously been applied to the policy networks that coalesced around the international response. hagel et al. show how research on the ebola virus rapidly increased during the crisis in west africa and identified a network of institutions affiliated through co-authorship. unfortunately, their data tell us very little about the type of research being published and how it was connected into policy processes (hagel et al., 2017) . in contrast, this paper seeks to inform the ongoing movements promoting interdisciplinarity as key to addressing global health challenges. zoonotic disease has been the subject of particular concerns around the, "connections and disconnections between social, political and ecological worlds" (bardosh, 2016, p. 232) . with the outbreak of covid-19 in china at the end of 2019, its rapid spread overseas and predictions of more frequent and more deadly pandemics and epidemics in the future, the importance of breaking down barriers between policy actors, humanitarians, social scientists, doctors and medical scientists can only increase with time. before we look at detailed accounts of events relating to the uk policy network, first we must consider what the key policy issues were relating to an anthropological response versus a purely clinical one. anthropological literature exists, from previous outbreaks, documenting the cultural practices that affected the spread of ebola (hewlett and hewlett, 2007) . the main concerns relate to how local practices may accelerate the spread of the virus and the need to address these in order to lower infection rates. ebola is highly contagious, particularly from contamination by bodily fluids. in west africa, many local customs exist around burial practices that clinicians believe heighten the risk to communities. common characteristics of these are, the washing of bodies by family members, passing clothing belonging to the deceased to family and the touching of the body (richards, 2016) . another concern, as the crisis unfolded, was people attempting to provide home care to victims of the virus. the clinical response was to create isolation units or ebola treatment units (etus) in which to assess and treat suspected cases (west & von saint andré-von arnim, 2014) . community based care centres were championed by the uk government but their deployment came late and opinion was divided around their effectiveness. clinicians regarded etus as an essential part of the response and wanted to educate people to discourage them from engaging in what they regarded as deeply unsafe practices, including home care (walsh and johnson, 2018) and (msf, 2015) . anthropologists with expertise in the region focused instead on engaging communities more constructively, managing stigma and understanding local behaviours and customs (fairhead, 2014) , (richards, 2014b) and (berghs, 2014) . anthropologist, paul richards, argues that agencies' and clinicians' lack of understanding of local customs worsened the crisis (richards, 2016) and that far from being ignorant and needing rescuing from themselves, communities had coping strategies of their own. his studies from sierra leone and liberia relate how some villages isolated themselves, created their own burial teams and successfully protected those who came in contact with suspected cases with makeshift protective garments (richards, 2014a) . anthropologists working in west africa during the epidemic prioritised studies of social mobilisation and community engagement and worked with communities directly on ebola transmission. sharon abramowitz, in her review of the anthropological response across guinea, liberia and sierra leone, provides examples from the field work of chiekh niang (fleck, 2015) , sylvain faye, juliene anoko, almudena mari saez, fernanda falero, patricia omidian, several medicine sans frontiers (msf) anthropologists and others (s. abramowitz, 2017) . however, abramowitz argues that learning generated by these ethnographic studies was largely ignored by the mainstream response. however, not everyone has welcomed the intervention of the international anthropological community. some critics have argued that social scientists in mostly european and north american universities were poorly suited to providing sound advice given their lack of familiarity with field-based operations. adia benton suggests that predominantly white northern anthropologists have an "inflated sense of importance" that led them to exaggerate the relevance of their research. this in turn helped reinforce concepts of "superior northern knowledge" (benton, 2017, p. 520 ). this racial optic seems to contradict the portrayal of plucky anthropologists being the victims of knowledge hierarchies that favour other knowledges over their own. our focus here, on the mobilisation of knowledge from an international community of experts, recommends that we consider how this can be understood in relation to group dynamics as well as individual relationships. particularly relevant is peter haas' theory of epistemic communities. haas helped define epistemic communities and how they differ from other policy communities, such as interest groups and advocacy coalitions (haas, 1992) . they share common principles and analytical and normative beliefs (causal beliefs). they have an authoritative claim to policy relevant expertise in a particular domain and haas claims that policy actors will routinely place them above other interest groups in their level of expertise. he believes that epistemic communities and smaller more temporary collaborations within them, can influence policy. he observes that in times of crisis and acute uncertainty, policy actors often turn to them for advice. the emergence of an epistemic community focused on the uk policy response was framed by the division of the affected countries between key donors along historic colonial lines. namely, the uk was to lead in sierra leone, the united states in liberia and the french in guinea. this seems to have focused social scientists in the uk on engaging effectively with a government and wider scientific community who seemed to want to draw on their expertise. this was a relatively close-knit community of scholars who already worked together, co-published and cited each other's work and in many cases worked in the same academic institutions. crucially, their ranks were swelled by a small number of epidemiologists and medical anthropologists who shared their concerns. from the time msf first warned the international community of an unprecedented outbreak of ebola in guinea at the end of march 2014, it was six months before an identifiable and organised movement of social scientists emerged (msf, 2015) . things began to happen quickly when the who announced in early september of that year that conventional biomedical responses to the outbreak were failing (who, 2014a) . this acted like a siren call to social scientists incensed by the reported treatment of local communities and the way in which a narrative had emerged blaming local customs and ignorance for the rapid spread of the virus. british anthropologist, james fairhead, hastily organised a special panel on ebola at the african studies association (asa) annual conference, that was taking place at the university of sussex (uos) on the september 10, 2014. amongst the panellists were: anthropologist melissa leach, director of the institute of development studies (ids); audrey gazepo, university of ghana, medical anthropologist melissa parker from the london school of hygiene and tropical medicine (lshtm); anthropologist and public health specialist, anne kelly from kings college london and stefan elbe, uos. informally, after the conference, this group discussed the idea of an online repository or platform for the supply of regionally relevant social science (f. martineau et al., 2017) . this would later become the ebola response anthropology platform (erap). in the days and weeks that followed it was the personal and professional connections of these individuals that shaped the network engaging with the uk's intervention. just two days after the emergency panel at the asa, jeremy farrar, director of the wellcome trust, convened a meeting of around 30 public health specialists and researchers, including leach, on the uk's response to the epidemic. discussions took place on the funding and organisation of the anthropological response. the government was already drawing on the expertise and capacity of public health england (phe), the ministry of defence (mod) and the department of health (doh), to drive its response but social scientists had no seat at the table. the government's chief medical officer (cmo) sally davies called a meeting of the ebola scientific assessment and response group (esarg), on the 19th september, focused on issues which included community transmission of ebola. leach's inclusion as the sole anthropologist was largely thanks to farrar and chris whitty, dfid's chief scientific advisor (m leach, 2014). there was already broad acceptance of the need for the response to focus on community engagement and the who had been issuing guidance on how to engage and what kind of messaging to use for those living in the worst affected areas (who, 2014c) . in their account of these events three of the central actors from the uk's anthropological community describe how momentum gathered quickly and that: "it felt as if we were pushing at an open door" (f. martineau et al., 2017, 481) . by the following month, the uk's coalition government was embracing its role as the leading bilateral donor in sierra leone and wanted to raise awareness and funds from other governments and foundations. a high level conference: defeating ebola in sierra leone, had been quickly organised, in partnership with the sierra leone government, at which an international call for assistance was issued (dfid, 2014) . it was shortly after this that the cmo, at the behest of the government's cabinet office briefing room (cobra), formed the scientific advisory group for emergencies on ebola (sage). by its first meeting on the october 16, 2014, british troops were on the ground along with volunteers from the uk national health service (nhs) (stc, 2016). leach was pulled into this group along with most of the members of esarg that had met the previous month. it was decided in this initial meeting to set up a social science sub-group including whitty, leach and the entire steering group of the newly established erap (sage, 2014a). this included not just british-based anthropologists but also paul richards and esther mokuwa from njala university, sierra leone. from this point anthropologists appeared plugged into the government's architecture for guiding their response. there were several modes for the interaction between social scientists and policy actors that focused on the uk led response. firstly, there were the formal meetings of committees or other bodies that were set up to directly advise the uk government in london. secondly, there were the multitude of ad-hoc interactions, conversations, meetings and briefings, some of which were supported with written reports. then, there was the distribution of briefings, reports and previously published works by erap which included use of the pre-existing health, education advice and resource team (heart) platform, which already provided bespoke services to dfid in the form of a helpdesk (heart, 2019). erap was up and running by the 14th october and during the crisis the platform published around 70 open access reports which were accessed by over 16,000 users (erap, 2016). there were also a series of webinars and workshops and an online course (lshtm, 2015) . according to ids and lshtm's application to the esrc's celebrating impact awards (m. leach et al., 2016) , the policy actors that participated in these interactions included: uk government officials in dfid's london head quarters and its sierra leone country office, in the mod and the government's office for science (go-science). closest of all to the prime minister and the cabinet office was sage. they also communicated with international non-governmental organisations (ingos) like help aged international and christian aid who requested briefings or meetings. erap members advised the who via three core committees, as well as the united nations mission for ebola emergency response (unmeer) and the united nations food and agricultural organisation (unfao). by the end of the crisis members of erap had given written and oral evidence to three separate uk parliamentary inquiries. these interactions were not entirely limited to policy audiences. erap members also contributed to the design of training sessions and a handbook on psychosocial impact of ebola delivered to all the clinical volunteers from the nhs prior to their deployment from december 2014 onwards (redruk, 2014). the way in which anthropologists engaged in policy and practice seemed to reflect an underlying assumption that they would work remotely to the response and engage primarily with the uk government, multilaterals and ingos. a strength of this approach, apart from the obvious personal safety and logistical implications, was that anthropologists enjoyed a proximity to key actors in london. face to face meetings could be held and committees joined in person (f. martineau et al., 2017) . a good example of a close working relationship that required a personal interaction were the links built with two policy analysts working in the mod. not even dfid staff had made this connection and it was thanks to a member of the erap steering committee that one of these officials was able to join the sage social science subcommittee and provide a valuable connection back into the ministry (martineau et al., 2017) . with proximity to the uk government in london came distance from the policy professionals and humanitarians in sierra leone. just 3% of erap's initial funding was focused on field work. although, this later went up and comparative analysis on resistance in guinea and sierra leone and between ebola and lassa fever was undertaken (wilkinson and fairhead, 2017) , as well as a review of the disaster emergency committee (dec) crisis appeal response (oosterhoff, 2015) . there was also an evaluation of the community care centres and additional funding from dfid supported village-level fieldwork by erap researchers from njala university, leading to advice to social mobilisation teams. nonetheless, the network's priority was on giving advice to donors and multilaterals, albeit at a great distance from the action. this type of intervention has not escaped accusations of "armchair anthropology" (delpla and fassin, 2012) in (s. abramowitz, 2017, p. 430 ). rather than relying solely on this qualitative account, drawn largely from those directly involved in these events, social network analysis (sna) produces empirical data for exploring the connections between individuals and within groups (crossley and edwards, 2016) . it is a quantitative approach rooted in graph theory and covers a range of methods which are frequently combined with qualitative methods (s. p. borgatti et al., 2018) . in this case, the network comprises of nodes who are the individuals identified as being directly involved in some of the key events just described. a second set of nodes are the events or interactions themselves. content analysis of secondary sources linked to these events provides an unobtrusive method for identifying in some detail the actors who will have left traces of their involvement. sna allows us to establish these actors' ties to common nodes (they were part of the same committee or event or contributed to the same reports.) furthermore, we can assign non-network related attributes to each of our nodes such as gender, location, role and organisation affiliation type. not only does this approach provide a quantitative assessment of who was involved and through which channels but the mathematical foundations of sna allow for whole network analysis of cohesion across certain groups. you may calculate levels of homophily (the tendency of individuals to associate with similar others) between genders, disciplines and organisational type and identify sub-networks and the super-connectors that bridge them (s. p. borgatti et al., 2018) . the descriptive and statistical examination of graphs provides a means with which to test a hypothesis and associated network theory that is concerned with the entirety of the social relations in a network and how these affect the behaviour of individuals within them (stovel and shaw, 2012) and (ward et al., 2011) . the quantitative analysis of secondary sources was conducted between march and june 2019, utilising content analysis of artefacts which included reports, committee papers, public statements, policy documentation and correspondence. sna software, ucinet, was used to analyse nodes and ties and netdraw for the visualisation of the network (s. and (s. . the source material is limited to artefacts relating to the uk government's response to the ebola outbreak in ii) the apparent prominence or influence of these groups on the uk's response to the crisis, iii) the remit of these groups to focus on the social response, as opposed to the purely clinical one. tracing the events and policy moments which reveal how individual social scientists engaged with the ebola crisis from mid-2014 requires one to look well beyond academic literature. whilst some of this material is openly available, a degree of insider knowledge is required to identify who the key actors were and the modes of their engagement. this is partly a reflection of a sociological approach to policy research that treats networks, only partially visible in the public domain, as a social phenomenon (gaventa, 2006) . the calculation of network homogeneity (how interconnected the network is), the identification of cliques or sub-networks and the centrality of particular nodes, can be mathematically stable measures of the function of the network. however, the reliability of this study mainly resides on its validity. the assignment of attributes is in some cases fairly subjective. whereas gender and location are verifiable, the choice of whether an individual is an international policy actor or a national policy actor must be inferred from their official role during the crisis period. sometimes this can be based on the identity of their home institution. given dfid's central focus on overseas development assistance, its officials have been classified as internationals, rather than nationals. in some cases, individuals may be qualified clinicians or epidemiologists but their role in the crisis may have been primarily policy related and not medical or scientific. therefore, they are classified as policy actors not scientists. other demographic attributes could have been identified such as race and age which would have enabled more options for data analysis. a key factor here is the use of a two mode matrix that identifies connections via people participating in the same events or forums, rather than direct social relationships such as friendship. therefore, measurement validity is largely determined by whether connections of this type can be used to determine how knowledge and expertise flow between individuals. to mitigate the risk that this measurement fails to capture knowledge exchange toward policy processes, particular care was taken with the sampling of focal events used to generate the network. the majority of errors in sna relate to the omission of nodes or ties. fig. 1 sets out the advantages and disadvantages of each of the selected events and the data artefacts used to identify associated individuals. i am aware that some critics might take exception to my choice of network. it is sometimes suggested that by focusing on northern dominated networks or the actions of bilaterals and multilaterals, you simply reinforce coloniality and a racist framing of development and aid (richardson et al., 2016) and (richardson, 2019) . however, there is a valid, even essential, purpose here. only by seeking to understand the politics of knowledge and the social and political dynamics of global health and humanitarian networks can we challenge injustice and historically reinforced narratives that favour some perspectives over others. the secondary sources identify 134 unique individuals, all but five of whom can be identified by name. four types of attribute are assigned to these nodes: gender, location (global north or south), organisation type and organisational role. attributes have been identified through an internet search of institutional websites, linkedin and related online resources. role and organisation type are recorded for the period of the crisis. the total number of nodes given at the bottom of fig. 2 is slightly lower due to the anonymity of five individuals whose gender and role could not be established. looking at this distribution of attributes across the whole network one can make the following observations in relation to how prominently different characteristics are represented: i. females slightly outnumber males in the social science category but there are twice as many male 'scientists other' than female. they are a combination of clinicians, virologists, epidemiologists and other biomedical expertise. ii. there are just nine southern based nodes out of a total of 134 and none of these are policy makers or practitioners. this is racially and geographically a northern network with just a sliver of west african perspectives. these included, yvonne aki-sawyerr, veteran ebola campaigner and current mayor of freetown, four academics from njala university and development professionals working in the sierra leone offices of agencies such as the unfao. iii. although 'scientists other' only just outnumber social scientists this is heavily skewed by one of the eight interaction nodesthe lessons for development conferencewhich was primarily a learning event and not part of the advisory processes around the response. many individuals who participated in this event are not active in any of the other seven interactions. if we remove these non-active nodes from the network, you get just 23 social scientists compared to 32 'scientist other'. the remaining core policy network of 77 individuals appears to be weighted towards the biomedical sciences. netdraw's standardised graph layout algorithm has been used in fig. 3 to optimise distances between nodes which helps to visualise cohesive sub-groups or sub-networks and produces a less cluttered picture (s. . however, it should be noted that graph layout algorithms provide aesthetic benefits at the expense of attribute based or values based accuracy. the exact length of ties between nodes and their position do not correspond exactly to the quantitative data. we can drag one of these nodes into another position to make it stand out more clearly without changing its mathematical or sociological properties (s. p. borgatti et al., 2018) . we can see in this graph layout the clustering of the eight interactive nodes or focal events and observe some patterns in the attributes of the nodes closest to them. the right-hand side is heavily populated with social scientists. as mentioned above, this is influenced by the lessons for development event. as you move to the left side fewer social scientists are represented and they are outnumbered by other disciplines. the state owned or driven interactions such as sage and parliamentary committees appear on this left side and the anthropological epistemic community driven or owned interactions, such as erap reports and lessons for development, appear on the right side. the apparent connectors or bridges are in the centre. these bridges can be conceptualised as both focal events, including the erap steering committee, the sage social science sub-committee and the asa ebola panel, or as the key individual nodes connected to these. we know that many informal interactions between researchers, officials and humanitarians are not captured here. we are only seeing a partial picture of the network, traces of which remain preserved in documents pertaining to the eight nodal events sampled. nonetheless, so far the quantitative data seem to correspond closely with the qualitative accounts of the crisis. also, of interest is the visual representation of organisation affiliation. all bar one of the 39 social scientists (in the whole network fig. 3 ) are affiliated to a research organisation, whereas one third of the members of other scientific disciplines are attached to government institutions, donors or multilaterals. these are the public health officials and virologists working in the doh, phe and elsewhere. they appear predominantly on the left side with much stronger proximity to government led initiatives. however, it is also clear that whilst social scientists are a small minority in the government led events, the right side of the graph includes a significant number of practitioners, policy actors and clinicians. it is this part of the network that most closely resembles an inter-epistemic community. for the centrally located bridging nodes we can see a small number of social scientists and policy actors embedded in government. as accounts of the crisis have suggested these individuals appear to have been the super-connectors. a final point of clarification is that this is not a map showing the actual knowledge flow between actors during the crisis. each of the spider shaped sub-networks represent co-occurrence of individuals on committees, panels and other groups. we can infer from this some likelihood of knowledge exchange but we cannot measure this. one exception to these co-occurrence types of tie between nodes are the erap reports (bottom right), which reveals a cluster of nodes who contributed to reports along with those who requested them. even though this represents a knowledge flow of sorts we can still only record the interaction and make assumptions about the actual flow of knowledge. a variation of degree centrality, eigenvector centrality, counts the number of nodes adjacent to a given node and weighs each adjacent node by its centrality. the eigenvector equation, used by netdraw, calculates each node's centrality proportionally to the sum of centralities of the nodes it is adjacent to (s. . netdraw increases the size of nodes in relation to their popularity or eigenvector value. the better connected nodes are to others who are also well connected the larger the nodes appear (s. p. borgatti et al., 2018) . in order to focus on the key influencers or knowledge brokers in the network, we entirely remove nodes solely connected to the lessons for development conference. as mentioned earlier, this event is a poor proxy for research-policy interactions and unduly over-represents social scientists who were otherwise unconnected to advisory or knowledge exchange activities. this reduces the number of individuals in the network from 134 to 77. we also utilise ucinet's transform function to convert the two-mode incidence matrix into a one mode adjacency matrix (s. . ties between nodes are now determined by connections through co-occurrence. we no longer need to see the events and committees themselves but can visualise the whole network as a social network of connected individuals. we can now observe and mathematically calculate, how inter-connected or homogeneous this research-policy network really is. we see in fig. 4 a more exaggerated separation of social science and other sciences on the right and left of the graph than in fig. 3 . we can also see three distinct sub-networks emerging, bridged by six key nodes with high centrality values. the highly interconnected sub-network on the right is shaped in part by erap and the production of briefings and their supply to a small number of policy actors. we can see here the visualisation of slightly higher centrality scores than for the government scientific advisors on the left. by treating this as a relational network we observe that interactions like the establishment of a sage sub-group for social scientists increased the homophily of the right side of the network and reduced its interconnectivity with the whole network. although, one must be cautious about assigning too much significance to the position of individual nodes in a whole network analysis, the central location of the two social scientists and a dfid official closely correspond to the accounts of the crisis. this heterogeneous brokerage demonstrates the tendency of certain types of actors to be the sole link between dissimilar nodes (hamilton et al., 2020) . likewise, some boundary nodes or outliers, such as one of the mod's advisors at the bottom of the network, are directly mentioned in the qualitative accounts. just four individuals in this whole network are based in africa, suggesting almost complete isolation from humanitarians operating on the ground and from african scholarship. both the qualitative accounts of the role of anthropologists in the crisis and the whole network analysis presented here largely, correspond with haas' definition of epistemic communities. the international community of anthropologists and medical anthropologists that mobilised in autumn 2014 do indeed share common principles and analytical and normative beliefs. debates around issues, such as the level to which communities could reduce transmission rates themselves, did not prevent this group from providing a coherent response to the key policy dilemmas. this community did indeed emerge or coalesce around the demand placed on their expertise by policy makers concerned with the community engagement dimensions of the response. in the area of burial practices, there does appear to be some indication of the knowledge of social scientists being incorporated into the response. various interactions between anthropologists, dfid and the who did provide the opportunity to raise the socio-political-economic significance of funerals. for example, it was explained that the funerals of high status individuals would be much more problematic in terms of the numbers of people exposed (f. martineau et al., 2017) . anthropologists contributed to the writing of the who's guidelines for safe and dignified burials (who, 2014b). however, their advice was only partially incorporated into these guidelines and the wider policies of the who at the time. the suggestion for a radical decentralised approach to formal burial response that would require the creation of community-based burial teams was ignored until much later in the crisis and never fully implemented. as loblova and dunlop suggest in their critique of epistemic community theory, the extent to which anthropology could influence policy was bounded by the beliefs and understanding of policy communities themselves (löblová, 2018) and (dunlop, 2017) . olga loblova argues that there is a selection bias in the tendency to look at case studies where there has been a shift in policy along the lines of the experts' knowledge. likewise, claire dunlop suggests that haas' framework may exaggerate their influence on policy. she separates the power of experts to control the production of knowledge and engage with key policy actors from policy objectives themselves. she refers to adult education literature and its implications for what decision makers learn from epistemic communities, or to put it another way, the cognitive influence of research evidence (dunlop, 2009) . she argues that the more control that knowledge exchange processes place with the "learners" in terms of framing, content and the intended policy outcomes, the less influential epistemic communities will be (dunlop, 2017) . hence, in contested areas such as home care, it was the more embedded and credible clinical epistemic community that prevailed. from october 2014, anthropologists were arguing that given limited access to etus, which were struggling at that time, home care was an inevitability and so should be supported. where they saw the provision of home care kits as an ethical necessity, many clinicians, humanitarians and global health professionals regarded home care as deeply unethical with the potential to lead to a two tier system of support (f. martineau et al., 2017) and (whitty et al., 2014) . in sierra leone, irish diplomat sinead walsh was baffled by what she saw as the blocking of the distribution of home care kits. an official from the us centres for disease control and protection (cdc) was quoted in an article in the new york times as saying that home care was: "admitting defeat" (nossiter, 2014) in (walsh and johnson, 2018) . home care was never prioritised in sierra leone whereas in liberia hundreds of thousands of kits were distributed (walsh and johnson, 2018) . in this area, clinicians, humanitarians and policy actors seemed to maintain a policy position directly opposed to anthropological based advice. network theory provides further evidence around why this may have been the case. in his study of uk think tanks, jordan tchilingirian suggests that policy think tanks operate on the periphery of more established networks and enjoy fluctuating levels of support and interest in their ideas. ideas and knowledge do not simply flow within the network, given that dominant paradigms and political, social and cultural norms privilege better established knowledge communities (tchilingirian, 2018) . this is reminiscent of meyer's work on the boundaries that exist between "amateurs" and "policy professionals" (meyer, 2008) . moira faul's research on global education policy networks proposes that far from being "flat," networks can augment existing power relations and knowledge hierarchies (faul, 2016) . this is worth considering when one observes how erap's supply of research knowledge and the sage sub-committee for anthropologists only increased the homophily of the social science sub-community, leaving it weakly connected to the core policy network (fig. 4.) . the positive influence of anthropological advice on the uk's response was cited by witnesses to the subsequent parliamentary committee inquiries in 2016. however, there is some indication of different groups or networks favouring different narratives. the international development select committee (idc) was very clear in its final report that social science had been a force for good in the response and recommended that dfid grow its internal anthropological capacity (idc, 2016a, b) . this contrasts to the report of the science and technology committee (stc), which despite including evidence from at least one anthropologist, does not make a direct reference to anthropology in its report (stc, 2016) . this is perhaps the public health officials in their core domain of infectious disease outbreaks reasserting their established authority. this sector has been described as the uk's "biomedical bubble" which benefits from much higher pubic support and funding than the social sciences (jones and wilsdon, 2018) . just the presence of anthropologists in an evidence session of the stc is a very rare event in contrast to the idc which regularly reaches out to social scientists. not everyone agrees that the threat of under-investing in social science was the primary issue. the stc's report highlights the view that there was a lack of front line clinicians represented on committees advising the uk government, particularly from aid organisations (stc, 2016) . regardless of assessments of how successfully anthropological knowledge influenced policy and practice during the epidemic, there has been a subsequent elevation of social science in global health preparedness and humanitarian response programmes. writing on behalf of the wellcome trust in 2018, joão rangel de almeida says: "epidemics are a social phenomenon as much as a biological one, so understanding people's behaviours and fears, their cultural norms and values, and their political and economic realities is essential too." (rangel de almeida, 2018). the social science in humanitarian action platform, which involves many of the same researchers who were part of the sierra leone response, has subsequently been supported by unicef, usaid and joint initiative on epidemic preparedness (jiep) with funding from dfid and wellcome. its network of social science advisers have been producing briefings to assist with the ebola response in the democratic republic of congo (drc) (farrar, 2019) and have mobilised in response to the covid-19 respiratory illness epidemic. network theory provides a useful framework with which to explore the politics of knowledge in global health with its emphasis on individuals' social context. by analysing data pertaining to researchers' and policy professionals' participation in policy networks one can test assumptions around interdisciplinarity and identify powerful knowledge gatekeepers. detailed qualitative accounts of policy processes needn't be available, as they were in this case, to employ this methodology. assuming the researcher has access to meeting minutes and other records of who attended which events or who was a member of which committees and groups, similar analysis of network homophily and centrality will be possible. the greatest potential for learning, with significant policy and research implications, comes from mixed methods approaches. by combining qualitative research to populate your network with a further round of data gathering to understand it better, you can reveal the social and political dynamics truly driving evidence use and decision making (oliver and faul, 2018) . although this study lacked this scope, it has still successfully identified the shape of the research-policy network that emerged around the uk led response to ebola and the clustering of actors within it. the network was a diverse group of scientists, practitioners and policy professionals. however, it favoured the views of government scientists with their emphasis on epidemiology and the medical response. it was also almost entirely lacking in west african members. nonetheless, it was largely thanks to a strong political demand for anthropological knowledge, in response to perceived community violence and distrust, that social scientists got a seat at the table. this was brokered by a small group of individuals from both government and research organisations, who had prior relationships to build on. the emergent inter-epistemic community was only partially connected into the policy network and we should reject the description of the whole network as trans-disciplinary. social scientists were most successful in engaging when they framed their expertise in terms of already widely accepted concepts, such as the need for better communications with communities. they were least successful when their evidence countered strongly held beliefs in areas such as home care. their high level of homophily as a group, or sub network, only deepened the ability of decision makers to ignore them when it suited them to do so. the epistemic community's interactivity with uk policy did not significantly alter policy design or implementation and it did not challenge fundamentally eurocentric development knowledge hierarchies. it was transformative only in as much as it helped the epistemic community itself learn how to operate in this environment. the real achievement has been on influencing longer term evidence use behaviours. they made the case for interdisciplinarity and the value of social science in emergency preparedness and response. the challenge now is moving from the rhetoric to action on complex infectious disease outbreaks. as demonstrated by ebola in drc and covid-19, every global health emergency we face will have its own unique social and political dimensions. we must remain cognisant of the learning arising from the international response to sierra leone's tragic ebola epidemic. it suggests that despite the increasing demand for interdisciplinarity, social science evidence is frequently contested and policy networks have a strong tendency to leave control over its production and use in the hands of others. credit authorship contribution statement james georgalakis: conceptualization, methodology, software, formal analysis, investigation, data curation, writing -original draft, visualization. epidemics (especially ebola) social science intelligence in the global ebola response one health : science, politics and zoonotic disease in africa ebola at a distance: a pathographic account of anthropology's relevance stigma and ebola: an anthropological approach to understanding and addressing stigma operationally in the ebola response ucinet 6 for windows. analytic technologies cases, mechanisms and the real: the theory and methodology of mixed-method social network analysis une histoire morale du temps present policy transfer as learning: capturing variation in what decisionmakers learn from epistemic communities the irony of epistemic learning: epistemic communities, policy learning and the case of europe's hormones saga ebola response anthropology platform erap milestone achievements up until the global community must unite to intensify ebola response in the drc networks and power: why networks are hierarchical not flat and what can be done about it the human factor. world health organization finding the spaces for change: a power analysis introduction: identifying the qualities of research-policy partnerships in international development-a new analytical framework introduction: epistemic communities and international policy coordination analysing published global ebola virus disease research using social network analysis evaluating heterogeneous brokerage: new conceptual and methodological approaches and their application to multi-level environmental governance networks health. education advice and resource team ebola, culture and politics: the anthropology of an emerging disease responses to the ebola crisis ebola: responses to a public health emergency. house of commons the biomedical bubble: why uk research and innovation needs a greater diversity of priorities the ebola epidemic: a transformative moment for global health ebola: engaging long-term social science research to transform epidemic response when epistemic communities fail: exploring the mechanism of policy influence online course: ebola in context: understanding transmission, response and control epistemologies of ebola: reflections on the experience of the ebola response anthropology platform on the boundaries and partial connections between amateurs and professionals pushed to the limit and beyond social constructionism as ontology: exposition and example a hospital from hell networks and network analysis in evidence, policy and practice ebola crisis appeal response review social science research: a much-needed tool for epidemic control. wellcome. redruk, 2014. pre-departure ebola response training burial/other cultural practices and risk of evd transmission in the mano river region burial/other cultural practices and risk of evd transmission in the mano river region ebola: how a people's science helped end an epidemic on the coloniality of global public health biosocial approaches to the 2013-2016 ebola pandemic. health hum. rights 18, 115. sage scientific advisory group for emergencies -ebola summary minute of 2nd meeting scientific advisory group for emergencies -ebola summary minute of 3rd meeting science in emergencies: uk lessons from ebola. house of commons. stovel producing knowledge, producing credibility: british think-tank researchers and the construction of policy reports the oxford handbook of political networks getting to zero: a doctor and a diplomat on the ebola frontline network analysis and political science clinical presentation and management of severe ebola virus disease infectious disease: tough choices to reduce ebola transmission key messages for social mobilization and community engagement in intense transmission areas comparison of social resistance to ebola response in sierra leone and guinea suggests explanations lie in political configurations not culture coronavirus: china president warns spread of disease 'accelerating', as canada confirms first case. the independent acknowledgements i thank dr jordan tchilingirian (university of western australia) for discussions and support on ucinet. i thank professor melissa leach and dr annie wilkinson (institute of development studies) for access to archival data. key: cord-019055-k5wcibdk authors: pacheco, jorge m.; van segbroeck, sven; santos, francisco c. title: disease spreading in time-evolving networked communities date: 2017-10-05 journal: temporal network epidemiology doi: 10.1007/978-981-10-5287-3_13 sha: doc_id: 19055 cord_uid: k5wcibdk human communities are organized in complex webs of contacts that may be represented by a graph or network. in this graph, vertices identify individuals and edges establish the existence of some type of relations between them. in real communities, the possible edges may be active or not for variable periods of time. these so-called temporal networks typically result from an endogenous social dynamics, usually coupled to the process under study taking place in the community. for instance, disease spreading may be affected by local information that makes individuals aware of the health status of their social contacts, allowing them to reconsider maintaining or not their social contacts. here we investigate the impact of such a dynamical network structure on disease dynamics, where infection occurs along the edges of the network. to this end, we define an endogenous network dynamics coupled with disease spreading. we show that the effective infectiousness of a disease taking place along the edges of this temporal network depends on the population size, the number of infected individuals in the population and the capacity of healthy individuals to sever contacts with the infected, ultimately dictated by availability of information regarding each individual’s health status. importantly, we also show how dynamical networks strongly decrease the average time required to eradicate a disease. understanding disease spreading and evolution involves overcoming a multitude of complex, multi-scale challenges of mathematical and biological nature [1, 2] . traditionally, the contact process between an infected individual and the susceptible ones was assumed to affect equally any susceptible in a population (mean-field approximation, well-mixed population approximation) or, alternatively, all those susceptible living in the physical neighborhood of the infected individual (spatial transmission). during recent years, however, it has become clear that disease spreading [2] [3] [4] [5] transcends geography: the contact process is no longer restricted to the immediate geographical neighbors, but exhibits the stereotypical small-world phenomenon [6] [7] [8] [9] , as testified by recent global pandemics (together with the impressive amount of research that has been carried out to investigate them) or, equally revealing, the dynamics associated with the spreading of computer viruses [5, [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] . recent advances in the science of networks [3, 4, 19, 24, 25] also provided compelling evidence of the role that the networks of contacts between individuals or computers play in the dynamics of infectious diseases [4, 7] . in the majority of cases in which complex networks of disease spreading have been considered [9] , they were taken to be a single, static entity. however, contact networks are intrinsically temporal entities and, in general, one expects the contact process to proceed along the lines of several networks simultaneously [11, 13-16, 18, 23, 24, 26-36] . in fact, modern societies have developed rapid means of information dissemination, both at local and at centralized levels, which one naturally expects to alter individuals' response to vaccination policies, their behavior with respect to other individuals and their perception of likelihood and risk of infection [37] . in some cases one may even witness the adoption of centralized measures, such as travel restrictions [38, 39] or the imposition of quarantine spanning parts of the population [40] , which may induce abrupt dynamical features onto the structure of the contact networks. in other cases, social media can play a determinant role in defining the contact network, providing crucial information on the dynamical patterns of disease spreading [41] . furthermore, the knowledge an individual has (based on local and/or social media information) about the health status of acquaintances, partners, relatives, etc., combined with individual preventive strategies [42] [43] [44] [45] [46] [47] [48] [49] [50] (such as condoms, vaccination, the use of face masks or prophylactic drugs, avoidance of visiting specific web-pages, staying away from public places, etc.), also leads to changes in the structure and shape of the contact networks that naturally acquire a temporal dimension that one should not overlook. naturally, the temporal dimension and multitude of contact networks involved in the process of disease spreading render this problem intractable from an analytic standpoint. recently, sophisticated computational platforms have been developed to deal with disease prevention and forecast [5, 10, 11, 18, 27, [29] [30] [31] [32] [33] [34] [35] [36] [51] [52] [53] [54] [55] . the computational complexity of these models reflects the intrinsic complexity of the problem at stake, and their success relies on careful calibration and validation procedures requiring biological and socio-geographic knowledge of the process at stake. our goal here, instead, will be to answer the following question: what is the impact of a temporal contact network structure in the overall dynamics of disease progression? does one expect that it will lead to a rigid shift of the critical parameters driving disease evolution, as one witnesses whenever one includes spatial transmission patterns? or even to an evanescence of their values whenever one models the contact network as a (static and infinite) scale-free network, such that the variance of the network degree distribution becomes arbitrarily large? or will the temporal nature of the contact network lead to new dynamical features? and, if so, which features will emerge from the inclusion of this temporal dimension? to answer this question computationally constitutes, in general, a formidable challenge. we shall attempt to address the problem analytically, and to this end some simplifications will be required. however, the simplifications we shall introduce become plausible taking into consideration recent results (i) in the evolutionary dynamics of social dilemmas of cooperation, (ii) in the dynamics of peer-influence, and even (iii) in the investigation of how individual behavior determines and is determined by the global, population wide behavior. all these recent studies point out to the fact that the impact of temporal networks in the population dynamics stems mostly from the temporal part itself, and not so much from the detailed shape and structure of the network [56] [57] [58] [59] [60] [61] [62] [63] . indeed, we now know that (i) different models of adaptive network dynamics lead to similar qualitative features regarding their impact in what concerns the evolution of cooperation [56] [57] [58] [59] [60] [61] [62] [63] , (ii) the degree of peer-influence is robust to the structural patterns associated with the underlying social networks [62] , and (iii) the impact of temporal networks in connecting individual to collective behavior in the evolution of cooperation is very robust and related to a problem of n-body coordination [61, 63] . altogether, these features justify that we model the temporal nature of the contact network in terms of a simple, adaptive network, the dynamics of which can be approximately described in terms a coupled system of odes. this "adaptive-linking" dynamics, as it was coined [28, [57] [58] [59] , leads to network snapshot structures that do not replicate what one observes in real-life, in the same sense that the small-world model of watts and strogatz does not lead to the heterogeneous and diverse patterns observed in data snapshots of social networks. notwithstanding, the active-linking dynamics allows us to include, analytically, the temporal dimension into the problem of disease dynamics. the results [28] , as we elaborate in sects. 3 and 4, prove rewarding, showing that the temporal dimension of a contact network leads to a shift of the critical parameters (defined below) which is no longer rigid but, instead, becomes dependent on the frequency of infected individuals in the population. this, we believe, constitutes a very strong message with a profound impact whenever one tries to incorporate the temporal dimension into computational models of disease forecast. this chapter is organized as follows. in the following sect. 2, we introduce the standard disease models we shall employ, as well as the details of the temporal contact network model. section 3 is devoted to present and discuss the results, and sect. 4 contains a summary of the main conclusions of this work. in this section, we introduce the disease models we shall employ which, although well-known and widely studied already, are here introduced in the context of stochastic dynamics in finite populations, a formulation that has received less attention than the standard continuous model formulation in terms of coupled ordinary differential equations (odes). furthermore, we introduce and discuss in detail the temporal contact network model. here we introduce three standard models of disease transmission that we shall employ throughout the manuscript, using this section at profit to introduce also the appropriate notation associated with stochastic dynamics of finite populations and the markov chain techniques that we shall also employ in the remainder of this chapter. we shall start by discussing the models in the context of well-mixed populations, which will serve as a reference scenario for the disease dynamics, leaving for the next section the coupling of these disease models with the temporal network model described below. we investigate the popular susceptible-infected-susceptible (sis) model [2, 4] , the susceptible-infected (si) model [2] used to study, e.g., aids [2, 64] , and the susceptible-infected-recovered (sir) model [2, 65] , more appropriate to model, for instance, single season flu outbreaks [2] or computer virus spreading [7] . it is also worth pointing out that variations of these models have been used to successfully model virus dynamics and the interplay between virus dynamics and the response of the immune system [66] . in the sis model individuals can be in one of two epidemiological states: infected (i) or susceptible (s). each disease is characterized by a recovery rate (•) and an infection rate (oe). in an infinite, well-mixed population, the fraction of infected individuals (x) changes in time according to the following differential equation where y d 1 x is the fraction of susceptible individuals and hki the average number of contacts of each individual [4] . there are two possible equilibria (p x d 0): x d 0 and x d 1 r 1 0 , where r 0 d hki/ı denotes the basic reproductive ratio. the value of r 0 determines the stability of these two equilibria: x d 1 r 1 0 is stable when r 0 > 1 and unstable when r 0 < 1. let us now move to finite populations, and consider the well-mixed case where the population size is fixed and equal to n. we define a discrete stochastic markov process describing the disease dynamics associated with the sis model. each configuration of the population, which is defined by the number of infected individuals i, corresponds to one state of the markov chain. time evolves in discrete steps and two types of events may occur which change the composition of the population: infection events and recovery events. this means that, similar to computer simulations of the sis model on networked populations, at most one infection or recovery event will take place in each (discrete) time step. thus, the dynamics can be represented as a markov chain m with nc1 states [67, 68] -as many as the number of possible configurations -illustrated in the following fig. 13 .1. in a finite, well-mixed population, the number i of infected will decrease at a rate given by where 0 denotes the recovery time scale, i n the probability that a randomly selected individual is infected and ı the probability that this individual recovers. adopting 0 as a reference, we assume that the higher the average number of contacts hki, the smaller the time scale inf at which infection update events occur ( inf d 0 /hki) [4] . consequently, the number of infected will also increase at a rate given by equations (13.1) and (13.2) define the transitions between different states. this way, we obtain the following transition matrix for m: where each element p kj of p represents the probability of moving from state k to state j during one time step. the state without any infected individual (id0) is an absorbing state of m. in other words, the disease always dies out and will never re-appear, once this happens. at this level of approximation, it is possible to derive an analytical expression for the average time t i it takes to reach the single absorbing state of the sis markov chain (i.e., the average time to absorption) starting from a configuration in which there are i infected individuals. denoting by p i (t) the probability that the disease disappears at time t when starting with i infected individuals at time 0, we may write [69] using the properties of p i (t) we obtain the following recurrence relation for t i whereas for t n we may write defining the auxiliary variables i d l , a little algebra allows us to write, for t 1 such that t i can be written as a function of t 1 as follows the intrinsic stochasticity of the model, resulting from the finiteness of the population, makes the disease disappear from the population after a certain amount of time. as such, the population size plays an important role in the average time to absorption associated with a certain disease, a feature we shall return to below. equations (13.1) and (13.2) define the markov chain m just characterized. the fraction of time the population spends in each state is given by the stationary distribution of m, which is defined as the eigenvector associated with eigenvalue 1 of the transition matrix of m [67, 68] . the fact that in the sis model the state without infected (id0) is an absorbing state of the markov chain, implies that the standard stationary distribution will be completely dominated by this absorbing state, which precludes one to gather information on the relative importance of other configurations. this makes the so-called quasi-stationary distribution of m [70] the quantity of interest. this quantity allows us to estimate the relative prevalence of the population in configurations other than the absorbing state, by computing the stationary distribution of the markov chain obtained from m by excluding the absorbing state id0 [70] . it provides information on the fraction of time the population spends in each state, assuming the disease does not go extinct. the markov process m defined before provides a finite population analogue of the well-known mean-field equations written at the beginning of sect. 2.1.1. indeed, in the limit of large populations, 0 g.i/ d t c .i/ t .i/ provides the rate of change of infected individuals. for large n, replacing i n by x and n i n by y, the gradients of infection which characterize the rate at which the number of infected are changing in the population, are given by again, we obtain two roots: 0 g(i) d 0 for i d 0 and i r 0 d n .n 1/ ı hki . moreover, i r 0 becomes the finite population equivalent of an interior equilibrium for r 0 á ı hki n n 1 > 1 (note that, for large n we have that n n 1 1). the disease will most likely expand whenever i < i r 0 , the opposite happening otherwise. the si model is mathematically equivalent to the sis model with ı d 0, and has been employed to study for instance the dynamics of aids. the markov chain representing the disease dynamics is therefore defined by transition matrix eq. (13.3), with t i d 0 for all i. the remaining transition probabilities t c i (0 < i < n) are exactly the same as for the sis model. since all t i equal zero, the markov chain has two absorbing states: the canonical one without any infected (id0) and the one without any susceptible (idn). the disease will expand monotonically as soon as one individual in the population gets infected, ultimately leading to a fully infected population. the average amount of time after which this happens, which we refer to as the average infection time, constitutes the main quantity of interest. this quantity can be calculated analytically [28] : the average number of time steps needed to reach 100% infection, starting from i infected individuals is given by (13.9) with sir one models diseases in which individuals acquire immunity after recovering from infection. we distinguish three epidemiological states to model the dynamics of such diseases: susceptible (s), infected (i) and recovered (r), indicating those who have become immune to further infection. the sir model in infinite, well-mixed populations is defined by a recovery rate ı and an infection rate . the fraction of infected individuals x changes in time according to the following differential equation where y denotes the fraction of susceptible individuals, which in turn changes according to p y d hki xy: (13.11) finally, the fraction of individuals z in the recovered class changes according to p z d ı x: (13.12) to address the sir model in finite, well-mixed populations, we proceed in a way similar to what we have done so far with sis and si models. the markov chain describing the disease dynamics becomes slightly more complicated and has states (i, r), where i is the number of infected individuals in the population and r the number of recovered (and immune) individuals (i c r ä n). a schematic representation of the markov chain is given in fig. 13 .2. note that the states (0, r), with 0 ä r ä n, are absorbing states. each of these states corresponds to the number of individuals that are (or have become) immune at the time the disease goes extinct. consider a population of size n with average degree hki. the number of infected will increase with a rate where 0 denotes the recovery time scale. as before, the gradient of infection g(i), such that 0 g.i/ d t c .i/ t .i/, measures the likelihood for the disease to either expand or shrink in a given state, and is given by note that we recover eq. (13.10) in the limit n ! 1. for a fixed number of recovered individuals r 0 , we have that 0 g(i, r 0 ) d 0 for i d 0 and for i r 0 d n becomes the finite population analogue of an interior equilibrium. furthermore, one can show that the partial derivative @g.i;r/ @i has at most one single root in (0, 1), possibly located at i r 0 d i r 0 2 ä i r 0 . hence, g(i, r 0 ) reaches a local maximum at i r 0 (given that at that point @ 2 g.i;r/ @i 2ˇi r 0 d 2hki n.n 1/ < 0). the number of infected will therefore most likely increase for i < i r 0 (assuming r 0 immune individuals), and most likely decrease otherwise. the gradient of infection also determines the probability to end up in each of the different absorbing states of the markov chain. these probabilities can be calculated analytically [28] . to this end, let us use y a i;r to denote the probability that the population ends up in the absorbing state with a recovered individuals, starting from a state with i infected and r recovered. we obtain the following recurrence relationship for y a i;r y a i;r d t .i; r/ y a i 1;rc1 c t c .i; r/ y a ic1;r c 1 t .i; r/ t c .i; r/ y a i;r ; (13.16) which reduces to the following boundary conditions (13.18) allow us to compute y a i;r for every a, i and r. our network model explicitly considers a finite and constant population of n individuals. its temporal contact structure allows, however, for a variable number of overall links between individuals, which in turn will depend on the incidence of disease in the population. this way, infection proceeds along the links of a contact network whose structure may change based on each individual's health status and the availability of information regarding the health status of others. we shall assume the existence of some form of local information about the health status of social contacts. information is local, in the sense that individual behavior will rely on the nature of their links in the contact network. moreover, this will influence the way in which individuals may be more or less effective in avoiding contact with those infected while remaining in touch with the healthy. suppose all individuals seek to establish links at the same rate c. for simplicity, we assume that new links are established and removed randomly, a feature which usually does not always apply in real cases, where the limited social horizon of individuals or the nature of their social ties may constrain part of their neighborhood structure (see below). let us further assume that links may be broken off at different rates, based on the nature of the links and the information available about the individuals they connect: let us denote these rates by b pq for links of type pq (p , q 2 fs, i, rg. we assume that links are bidirectional, which means that we have links of pq types si, sr, and ir. let l pq denote the number of links of type pq and l m pq the maximum possible number of links of that type, given the number of individuals of type s, i and r in the population. this allows us to write down (at a mean-field level) a system of odes [57, 58] for the time evolution of the number of links of pq-type (l pq ) [57, 58] which depends on the number of individuals in states p and q (l m pp d p .p 1/ =2 and l m pq d pq for p ¤ q) and thereby couples the network dynamics to the disease dynamics. in the steady state of the linking dynamics ( p l pq d 0), the number of links of each type is given by l pq d ' pq l m pq , with ' pq d c/(c c b pq ) the fractions of active pq-links, compared to the maximum possible number of links l m pq , for a given number of s, i and r. in the absence of disease only ss links exist, and hence ss determines the average connectivity of the network under disease free conditions, which one can use to characterize the type of the population under study. in the presence of i individuals, to the extent that s individuals manage to avoid contact with i, they succeed in escaping infection. thus, to the extent that individuals are capable of reshaping the contact network based on available information of the health status of other individuals, disease progression will be inhibited. in the extreme limit of perfect information and individual capacity to immediately break up contacts with infected, we are isolating all infected, and as such containing disease progression. our goal here, however, is to understand how and in which way local information, leading to a temporal reshaping of the network structure, affects overall disease dynamics. we investigate the validity of the approximations made to derive analytical results as well as their robustness by means of computer simulations. all individual-based simulations start from a complete network of size nd100. disease spreading and network evolution proceed together under asynchronous updating. disease update events take place with probability (1 c ) 1 , where d net / dis . we define dis as the time-scale of disease progression, whereas net is the time scale of network change. the parameter d net / dis provides the relative time scale in terms of which we may interpolate between the limits when network adaptation is much slower than disease progression ( ! 0) and the opposite limit when network adaptation is much faster than disease progression ( ! 1). since d net / dis is the only relevant parameter, we can make, without loss of generality, dis d 1. for network update events, we randomly draw two nodes from the population. if connected, then the link disappears with probability given by the respective b pq . otherwise, a new link appears with probability c. when a disease update event occurs, a recovery event takes place with probability (1 c hki) 1 , an infection event otherwise. in both cases, an individual j is drawn randomly from the population. if j is infected and a recovery event has been selected then j will become susceptible (or recovered, model dependent) with probability •. if j is susceptible and an infection event occurs, then j will get infected with probability oe if a randomly chosen neighbor of j is infected. the quasi-stationary distributions are computed (in the case of the sis model) as the fraction of time the population spends in each configuration (i.e., number of infected individuals) during 10 9 disease event updates (10 7 generations; under asynchronous updating, one generation corresponds to n update events, where n is the population size; this means that in one generation, every individual has one chance, on average, to update her epidemic state). the average number of infected hii and the mean average degree of the network hki observed during these 10 7 generations are kept for further plotting. we have checked that the results reported are independent of the initial number of infected in the network. finally, for the sir and si models, the disease progression in time, shown in the following sections, is calculated from 10 4 independent simulations, each simulation starting with 1 infected individual. the reported results correspond to the average amount of time at which i individuals become infected. in this section we start by (i) showing that a quickly adapting community induces profound changes in the dynamics of disease spreading, irrespective of the underlying epidemic model; then, (ii) we resort to computer simulations to study the robustness of these results for intermediate time-scales of network adaptation; finally, (iii) we profit from the framework introduced above to analyze the impact of information on average time for absorption and disease progression in adaptive networks. empirically, it is well-known that often individuals prevent infection by avoiding contact with infected once they know the state of their contacts or are aware of the potential risks of such infection [31, 33, [42] [43] [44] [45] [46] [47] [48] [49] [50] : such is the case of many sexually transmitted diseases [42, [71] [72] [73] , for example, and, more recently, the voluntary use of face masks and the associated campaigns adopted by local authorities in response to the sars outbreak [40, [43] [44] [45] or even the choice of contacting or not other individuals based on information on their health status gathered from social media [41, 74, 75] . in the present study, individual decision is based on available local information about the health state of one's contacts. thus, we can study analytically the limit in which the network dynamics -resulting from adaptation to the flow of local information -is much faster than disease dynamics, as in this case, one may separate the time scales between network adaptation and contact (disease) dynamics: the network has time to reach a steady state before the next contact takes place. consequently, the probability of having an infected neighbor is modified by a neighborhood structure which will change in time depending on the impact of the disease in the population and the overall rates of severing links with infected. let us start with the sir model. the amount of information available translates into differences mostly between the break-up rates of links that may involve a potential risk for further infection (b si , b ir , b ii ), and those that do not (b ss , b sr , b rr ). therefore, we consider one particular rate b i for links involving infected individuals (b i á b si d b ir d b ii ), and another one, b h , for links connecting healthy . in general, one expects b i to be maximal when each individual has perfect information about the state of her neighbors and to be (minimal and) equal to b h when no information is available, turning the ratio between these two rates into a quantitative measure of the efficiency with which links to infected are severed compared to other links. note that we reduce the model to two break-up rates in order to facilitate the discussion of the results. numerical simulations show that the general principles and conclusions remain valid when all break-up rates are incorporated explicitly. it is worth noticing that three out of these six rates are of particular importance for the overall disease dynamics: b ss , b sr and b si . these three rates, combined with the rate c of creating new links, define the fraction of active ss, sr and si links, and subsequent correlations between individuals [76] , and therefore determine the probability for a susceptible to become infected (see models and methods). this probability will increase when considering higher values of c (assuming b i > b h ). in other words, when individuals create new links more often, therefore increasing the likelihood of establishing connections to infected individuals (when present), they need to be better informed about the health state of their contacts in order to escape infection. in the fast linking limit, the other three break-up rates (b ii , b ir and b rr ) will also influence disease progression since they contribute to changing the average degree of the network. when the time scale for network update ( net ) is much smaller than the one for disease spreading ( dis ), we can proceed analytically using at profit the separation of times scales. in practice, this means that the network has time to reach a steady state before the next disease event takes place. consequently, the probability of having an infected neighbor is modified by a neighborhood structure which will change in time depending on the impact of the disease in the population and the overall rates of severing links with infected individuals. for a given configuration (i,r) of the population, the stationary state of the network is characterized by the parameters ' ss , ' si and ' sr . consequently, the number of infected increases at a rate [28] where we made 0 d 1. the effect of the network dynamics becomes apparent in the third factor, which represents the probability that a randomly selected neighbor of a susceptible is infected. in addition, eq. (13.14) remains valid, as the linking dynamics does not affect the rate at which the number of infected decreases. it is noteworthy that we can write eq. (13.19) in the form which is formally equivalent to eq. (13.13) and shows that disease spreading in a temporal adaptive network is equivalent to that in a well-mixed population with (i) a frequency dependent average degree hki and (ii) a transmission probability that is rescaled compared to the original according to note that this expression remains valid for both sir, sis (r d 0) and si (ı d 0, r d 0) models. since the lifetime of a link depends on its type, the average degree hki of the network depends on the number of infected in the population, and hence becomes frequency (and time) dependent, as hki depends on the number of infected (through l m pq ) and changes in time. note that á scales linearly with the frequency of infected in the population, decreasing as the number of infected increases (assuming ss ı si > 1); moreover, it depends implicitly (via the ratio ss ı si ) on the amount of information available. it is important to stress the distinction between the description of the disease dynamics at the local level (in the vicinity of an infected individual) and that at the population wide level. strictly speaking, a dynamical network does not change the disease dynamics at the local level, meaning that infected individuals pass the disease to their neighbors with probability intrinsic to the disease itself. at the population level, on the other hand, disease progression proceeds as if the infectiousness of the disease effectively changes, as a result of the network dynamics. consequently, analyzing a temporal network scenario at a population level can be achieved via a renormalization of the transmission probability, keeping the (mathematically more attractive) well-mixed scenario. in this sense, from a well-mixed perspective, dynamical networks contribute to changing the effective infectiousness of the disease, which becomes frequency and information dependent. note further that this information dependence is a consequence of using a single temporal network for spreading the disease and information. interestingly, adaptive networks have been shown to have a similar impact in social dilemmas [63] . from a global, population-wide perspective, it is as if the social dilemma at stake differs from the one every individual actually plays. as in sect. 2, one can define a gradient of infection g, which measures the tendency of the disease to either expand or shrink in a population with given configuration (defined by the number of individuals in each of the states s, i and r). to do so, we study the partial derivative @g.i;r/ @i at i d 0 this quantity exceeds zero whenever note that taking r d 0 yields the basic reproductive ratio r a 0 for both sir and sis: on the other hand, whenever r a 0 < 1, eradication of the disease is favored in the sis model (g(i)<0), irrespective of the fraction of infected, indicating how the presence of information (b h < b i ) changes the basic reproductive ratio. in fig. 13.3 we illustrate the role of information in the sis model by plotting g for different values of b i (assuming b h < b i ) and a fixed transmission probability . the corresponding quasi-stationary distributions are shown in the right panel and clearly reflect the sign of g. whenever g(i) is positive (negative), the dynamics will act to increase (decrease), on average, the number of infected. figure 13 population and, once again, allows us to identify when disease expansion will be favored or not. figure 13 .4 gives a complete picture of the gradient of infection, using the appropriate simplex structure in which all points satisfy the relation icrcsdn. the dashed line indicates the boundary g(i, r) d 0 in case individuals do not have any information about the health status of their contacts, i.e., links that involve infected individuals disappear at the same rate as those that do not (b i d b h ). disease expansion is more likely than disease contraction (g(i, r) > 0) when the population is in a configuration above the line, and less likely otherwise. similarly, the solid line indicates the boundary g(i, r) d 0 when individuals share information about their health status, and use it to avoid contact with infected. once again, the availability of information modifies the disease dynamics, inhibiting disease progression for a broad range of configurations. up to now we have assumed that the network dynamics proceeds much faster than disease spreading (the limit ! 0). this may not always be the case, and hence it is important to assess the domain of validity of this limit. in the following, we use computer simulations to verify to which extent these results, obtained analytically via time scale separation, remain valid for intermediate values of the relative timescale for the linking dynamics. we start with a complete network of size n, in which initially one individual is infected, the rest being susceptible. as stated before, disease spreading and network evolution proceed simultaneously under asynchronous updating. network update events take place with probability (1 c ) 1 , whereas a disease model (si, sis or sir) state update event occurs otherwise. for each value of , we run 10 4 simulations. for the si model, the quantity of interest to calculate is the average number of generations after which the population becomes completely infected. these values are depicted in fig. 13 .5. the lower dashed line indicates the analytical prediction of the infection time in the limit ! 1 (the limit when networks remain static), which we already recover in the simulations for > 10 2 . when is smaller than 10 2 , the average infection time significantly increases, and already reaches the analytical prediction for the limit ! 0 (indicated by the upper dashed line) when < 1. hence, the validity of the time scale separation does again extend well beyond the limits one might expect. for the sir model, we let the simulations run until the disease goes extinct, and computed the average final fraction of individuals that have been affected by is given by eqs. (13.17) and (13.18) . one observes that linking dynamics does not affect disease dynamics for > 10. once drops below ten, a significantly smaller fraction of individuals is affected by the disease. this fraction reaches the analytical prediction for ! 0 as soon as < 0.1. hence, and again, results obtained via separation of time scales remain valid for a wide range of intermediate time scales. we finally investigate the role of intermediate time scales in the sis model. we performed computer simulations in the conditions discussed already, and computed several quantities that we plot in fig. 13.7 . figure 13 .7 shows the average hii of the quasi-stationary distributions obtained via computer simulations (circles) as a function of the relative time scale of network update. whenever ! 1, we can characterize the disease dynamics analytically, assuming a well-mixed population (complete graph), whereas for ! 0 we recover the analytical results obtained in the fast linking limit. at intermediate time scales, fig. 13 .7 shows that as long as is smaller than ten, network dynamics contributes to inhibit disease spreading by effectively increasing the critical infection rate. overall, the validity of the time scale separation extends well beyond the limits one might anticipate based solely on the time separation ansatz. as long as the time scale for network update is smaller than the one for disease spreading ( < 1), the analytical prediction for the limit ! 0, indicated by the lower dashed line in fig. 13.7 , remains valid. the analytical result in the extreme opposite limit ( ! 1), indicated by the upper dashed line in fig. 13 .7, holds as long as > 10 5 . moreover, it is noteworthy that the network dynamics influences the disease dynamics both by reducing the frequency of interactions between susceptible and infected, and by reducing the average degree of the network. these complementary effects are disentangled in intermediate regimes, in which the network dynamics is too slow to warrant sustained protection of susceptible individuals from contacts with infected, despite managing to reduce the average degree (not shown). in fact, for > 10 the disease dynamics is mostly controlled by the average degree, as shown by the solid lines in fig. 13.7 . here, the average stationary distribution was determined by replacing, in the analytic expression for static networks, hki by the time-dependent average connectivity hki computed numerically. this, in turn, results from the frequency dependence of hki. when b i > b h , the network will reshape into a configuration with smaller hki as soon as the disease expansion occurs. for < 1, hki reflects the lifetime of ss links, as there are hardly any infected in the population. for 10 0 < < 10 3 , the network dynamics proceeds fast enough to reduce hki, but too slowly to reach its full potential in hindering disease progression. given the higher fraction of infected, and the fact that si and ii links have a shorter lifetime than ss links, the average degree drops when increasing from 1 to 10 3 . any further increase in leads to a higher average degree, as the network approaches its static limit. contrary to the deterministic sis model, the stochastic nature of disease spreading in finite populations ensures that the disease disappears after some time. however, this result is of little relevance given the times required to reach the absorbing state (except, possibly, in very small communities). indeed, the characteristic time scale of the dynamics plays a determinant role in the overall epidemiological process and constitutes a central issue in disease spreading. figure 13 .8 shows the average time to absorption t 1 in adaptive networks for different levels of information, illustrating the spectacular effect brought about by the network dynamics on t 1 . while on networks without information (b i d b h ) t 1 rapidly increases with the rate of infection oe, adding information moves the fraction of infected individuals rapidly to the absorbing state, and, therefore, to the disappearance of the disease. moreover, the size of the population can have a profound effect on t 1 . with increasing population size, the population spends most of the time in the vicinity of the state associated with the interior root of g(i). for large populations, this acts to reduce the intrinsic stochasticity of the dynamics, dictating a very slow extinction of the disease, as shown in fig. 13.9 . when recovery from the disease is impossible, a situation captured by the si model, the population will never become disease-free again once it acquires at least one infected individual. the time to reach absorbing state in which all individuals are infected, again depends on the presence of information. when information prevails, susceptible individuals manage to resist infection for a long time, thereby delaying the rapid progression of the disease, as shown in the inset of fig. 13.10 . naturally, the average number of generations needed to reach a fully infected population increases with the availability of information, as illustrated in the main panel of fig. 13.10 . making use of three standard models of epidemics involving a finite population in which infection takes place along the links of a temporal graph, the nodes of which are occupied by individuals, we have shown analytically that the bias introduced into the graph dynamics resulting from the availability of information about the health status of others in the population induces fundamental changes in the overall dynamics of disease progression. the network dynamics employed here differs from those used in most other studies [29, [32] [33] [34] [35] [36] [51] [52] [53] [54] [55] . we argue, however, that the differences obtained stem mostly from the temporal aspect of the network, and not so much from the detailed dynamics that is implemented. importantly, temporal network dynamics leads to additional changes in r 0 compared to those already obtained when moving from the well-mixed assumption to static networks [77] . an important ingredient of our model, however, is that the average degree of the network results from the selforganization of the network structure, and co-evolves with the disease dynamics. a population suffering from high disease prevalence where individuals avoid contact in order to escape infection will therefore exhibit a lower average degree than a population with hardly any infected individuals. such a frequency-dependent average degree further prevents that containment of infected individuals would result in the formation of cliques of susceptible individuals, which are extremely vulnerable to future infection, as reported before [36, 51, 54] . the description of disease spreading as a stochastic contact process embedded in a markov chain constitutes a second important ingredient of the present model. this approach allows for a direct comparison between analytical predictions and individual-based computer simulations, and for a detailed analysis of finite-size effects and convergence times, whose exponential growth will signal possible bistable disease scenarios. in such a framework, we were able to show that temporal adaptive networks in which individuals may be informed about the health status of others lead to a disease whose effective infectiousness depends on the overall number of infected in the population. in other words, disease propagation on temporal adaptive networks can be seen as mathematically equivalent to disease spreading on a well-mixed population, but with a rescaled effective infectiousness. in accord with the intuition advanced in the introduction, as long as individuals react promptly and consistently to accurate available information on whether their acquaintances are infected or not, network dynamics effectively weakens the disease burden the population suffers. last but not least, if recovery from the disease is possible, the time for disease eradication drastically reduces whenever individuals have access to accurate information about the health state of their acquaintances and use it to avoid contact with those infected. if recovery or immunity is impossible, the average time needed for a disease to spread increases significantly when such information is being used. in both cases, our model clearly shows how availability of information hinders disease progression (by means of quick action on infected, e.g., their containment via link removal), which constitutes a crucial factor to control the development of global pandemics. finally, it is also worth mentioning that knowledge about the health state of others may not always be accurate or available in time. this is for instance the case for diseases where recently infected individuals remain asymptomatic for a substantial period. the longer the incubation period associated with the disease, the less successful individuals will be in escaping infection, which in our model translates into a lower effective rate of breaking si links, with the above mentioned consequences. moreover, different (social) networks through which awareness of the health status of others proceeds may lead to different rates of information spread. one may take these features into account by modeling explicitly the spread of information through a coupled dynamics between disease expansion and individuals' awareness of the disease [31, 33] . creation and destruction of links may for instance not always occur randomly, as we assumed here, but in a way that is biased by a variety of factors such as social and genetic distance, geographical proximity, family ties, etc. the resulting contact network may therefore become organized in a specific way, promoting the formation of particular structures, such as networks characterized by long-tailed degree distributions or with strong topological correlations among nodes [3, [78] [79] [80] which, in turn, may influence the disease dynamics. the impact of combining such effects, resulting from specific disease scenarios, with those reported here will depend on the prevalence of such additional effects when compared to linkrewiring dynamics. a small fraction of non-random links, or of ties which cannot be broken, will likely induce small modifications on the average connectivity of the contact network, which can be incorporated in our analytic expressions without compromising their validity regarding population wide dynamics. on the other hand, when the contact network is highly heterogeneous (e.g., exhibiting pervasive long-tail degree distributions), non-random events may have very distinct effects, from being almost irrelevant (and hence can be ignored) to inducing hierarchical cascades of infection [81] , in which case our results will not apply. modeling infectious diseases in humans and animals infectious diseases in humans evolution of networks: from biological nets to the internet and www dynamical processes in complex networks epidemic processes in complex networks small worlds: the dynamics of networks between order and randomness epidemiology. how viruses spread among computers and people epidemic spreading and cooperation dynamics on homogeneous small-world networks network structure and the biology of populations how to estimate epidemic risk from incomplete contact diaries data? quantifying social contacts in a household setting of rural kenya using wearable proximity sensors epidemic risk from friendship network data: an equivalence with a non-uniform sampling of contact networks spatiotemporal spread of the 2014 outbreak of ebola virus disease in liberia and the effectiveness of non-pharmaceutical interventions: a computational modelling analysis the basic reproduction number as a predictor for epidemic outbreaks in temporal networks information content of contact-pattern representations and predictability of epidemic outbreaks birth and death of links control disease spreading in empirical contact networks influenza a (h7n9) and the importance of digital epidemiology predicting and controlling infectious disease epidemics using temporal networks. f1000 prime reports localization and spreading of diseases in complex networks the global obesity pandemic: shaped by global drivers and local environments a highresolution human contact network for infectious disease transmission dynamics and control of diseases in networks with community structure modelling the influence of human behaviour on the spread of infectious diseases: a review a guide to temporal networks temporal networks empirical temporal networks of face-to-face human interactions exploiting temporal network structures of human interaction to effectively immunize populations adaptive contact networks change effective disease infectiousness and dynamics rewiring for adaptation adaptive networks: coevolution of disease and topology endemic disease, awareness, and local behavioural response contact switching as a control strategy for epidemic outbreaks the spread of awareness and its impact on epidemic outbreaks infection spreading in a population with evolving contacts fluctuating epidemics on adaptive networks adaptive coevolutionary networks: a review long-standing influenza vaccination policy is in accord with individual self-interest but not with the utilitarian optimum modeling the worldwide spread of pandemic influenza: baseline case and containment interventions forecast and control of epidemics in a globalized world public health measures to control the spread of the severe acute respiratory syndrome during the outbreak in toronto digital epidemiology the responsiveness of the demand for condoms to the local prevalence of aids influenza pandemic: perception of risk and individual precautions in a general population impacts of sars on health-seeking behaviors in general population in hong kong capturing human behaviour knowledge of malaria, risk perception, and compliance with prophylaxis and personal and environmental preventinve measures in travelers exiting zimbabwe from harare and victoria falls international airport meta-analysis of the relationship between risk perception and health behavior: the example of vaccination risk compensation and vaccination: can getting vaccinated cause people to engage in risky behaviors? public perceptions, anxiety, and behaviour change in relation to the swine flu outbreak: cross sectional telephone survey early assessment of anxiety and behavioral response to novel swineorigin influenza a(h1n1) epidemic dynamics on an adaptive network susceptible-infected-recovered epidemics in dynamic contact networks disease spreading with epidemic alert on small-world networks robust oscillations in sis epidemics on adaptive networks: coarse graining by automated moment closure coevolutionary cycling of host sociality and pathogen virulence in contact networks cooperation prevails when individuals adjust their social ties coevolution of strategy and structure in complex networks with dynamical linking active linking in evolutionary games repeated games and direct reciprocity under active linking reacting differently to adverse ties promotes cooperation in social networks selection pressure transforms the nature of social dilemmas in adaptive networks origin of peer influence in social networks linking individual and collective behavior in adaptive social networks uses and abuses of mathematics in biology a contribution to the mathematical theory of epidemics production of resistant hiv mutants during antiretroviral therapy a first course in stochastic processes stochastic processes in physics and chemistry fixation of strategies for an evolutionary game in finite populations on the quasi-stationary distribution of the stochastic logistic epidemic men's behavior change following infection with a sexually transmitted disease an examination of the social networks and social isolation in older and younger adults living with hiv/aids social stigmatization and hepatitis c virus infection assessing vaccination sentiments with online social media: implications for infectious disease dynamics and control social and news media enable estimation of epidemiological patterns early in the 2010 haitian cholera outbreak the effects of local spatial structure on epidemiological invasions epidemic processes over adaptive state-dependent networks classes of small-world networks statistical mechanics of complex networks the structure and function of complex networks velocity and hierarchical spread of epidemic outbreaks in scale-free networks key: cord-225177-f7i0sbwt authors: pastor-escuredo, david; tarazona, carlota title: characterizing information leaders in twitter during covid-19 crisis date: 2020-05-14 journal: nan doi: nan sha: doc_id: 225177 cord_uid: f7i0sbwt information is key during a crisis such as the current covid-19 pandemic as it greatly shapes people opinion, behaviour and even their psychological state. it has been acknowledged from the secretary-general of the united nations that the infodemic of misinformation is an important secondary crisis produced by the pandemic. infodemics can amplify the real negative consequences of the pandemic in different dimensions: social, economic and even sanitary. for instance, infodemics can lead to hatred between population groups that fragment the society influencing its response or result in negative habits that help the pandemic propagate. on the contrary, reliable and trustful information along with messages of hope and solidarity can be used to control the pandemic, build safety nets and help promote resilience and antifragility. we propose a framework to characterize leaders in twitter based on the analysis of the social graph derived from the activity in this social network. centrality metrics are used to identify relevant nodes that are further characterized in terms of users parameters managed by twitter. we then assess the resulting topology of clusters of leaders. although this tool may be used for surveillance of individuals, we propose it as the basis for a constructive application to empower users with a positive influence in the collective behaviour of the network and the propagation of information. misinformation and fake news are a recurrent problem of our digital era [1] [2] [3] . the volume of misinformation and its impact grows during large events, crises and hazards [4] . when misinformation turns into a systemic pattern it becomes an infodemic [5, 6] . infodemics are frequent specially in social networks that are distributed systems of information generation and spreading. for this to happen, the content is not the only variable but the structure of the social network and the behavior of relevant people greatly contribute [6] . during a crisis such as the current covid-19 pandemic, information is key as it greatly shapes people's opinion, behaviour and even their psychological state [7] [8] [9] . however, the greater the impact the greater the risk [10] . it has been acknowledged from the secretary-general of the united nations that the infodemic of misinformation is an important secondary crisis produced by the pandemic. during a crisis, time is critical, so people need to be informed at the right time [11, 12] . furthermore, information during a crisis leads to action, so population needs to be properly informed 1 center of innovation and technology for development, technical university madrid, spain 2 lifed lab, madrid, spain to act right [13] . thus, infodemics can amplify the real negative consequences of the pandemic in different dimensions: social, economic and even sanitary. for instance, infodemics can lead to hatred between population groups [14] that fragment the society influencing its response or result in negative habits that help the pandemic propagate. on the contrary, reliable and trustful information along with messages of hope and solidarity can be used to control the pandemic, build safety nets and help promote resilience and antifragility. to fight misinformation and hate speech,content-based filtering is the most common approach taken [6, [15] [16] [17] . the availability of deep learning tools makes this task easier and scalable [18] [19] [20] . also, positioning in search engines is key to ensure that misinformation does not dominate the most relevant results of the searches. however, in social media, besides content, people's individual behavior and network properties, dynamics and topology are other relevant factors that determine the spread of information through the network [21] [22] [23] . we propose a framework to characterize leaders in twitter based on the analysis of the social graph derived from the activity in this social network [24] . centrality metrics are used to identify relevant nodes that are further characterized in terms of users' parameters managed by twitter [25] [26] [27] [28] [29] . although this tool may be used for surveillance of individuals, we propose it as the basis for a constructive application to empower users with a positive influence in the collective behaviour of the network and the propagation of information [27, 30] . tweets were retrieved using the real-time streaming api of twitter. two concurrent filters were used for the streaming: location and keywords. location was restricted to a bounding box enclosing the city of madrid [-3.7475842804 each tweet was analyzed to extract mentioned users, retweeted users, quoted users or replied users. for each of these events the corresponding nodes were added to an undirected graph as well as a corresponding edge initializing the edge property "flow". if the edge was already created, the property "flow" was incremented. this procedure was repeated for each tweet registered. the network was completed by adding the property "inverse flow", that is 1/flow, to each edge. the resulting network featured 107544 nodes and 116855 edges. to compute centrality metrics the network described above was filtered. first, users with a node degree (number of edges connected to the note) less than a given threshold (experimentally set to 3) were removed from the network as well as the edges connected to those nodes. the reason of this filtering was to reduce computation cost as algorithms for centrality metrics have a high computation cost and also removed poorly connected nodes as the network built comes from sparse data (retweets, mentions and quotes). however, it is desirable to minimize the amount of filtering performed to study large scale properties within the network. the resulting network featured 15845 nodes and 26837 edges. additionally the network was filtered to be connected which is a requirement for the computation of several of the centrality metrics described bellow. for this purpose the subnetworks connected were identified, selecting the largest connected network as the target network for analysis. the resulting network featured 12006 nodes and 25316 edges. several centrality metrics were computed: cfbetweenness, betweenness, closeness, cfcloseness, eigenvalue, degree and load. each of this centrality metric highlights a specific relevance property of a node with regards to the whole flow through the network. descriptors explanations are summarized in table 1 . besides the network-based metrics, twitter user' parameters were collected: followers, following and favorites so the relationships with relevance metrics could be assessed. we applied several statistical tools to characterize users in terms of the relevance metrics. we also implemented visualizations of different variables and the network for a better understanding of leading nodes characterization and topology. we compared the relevance in the network derived from the centrality metrics with the user' profile variables of twitter: number of followers, number of following and retweet count. figure 1 shows a scatter plots matrix among all variables. principal diagonal of the figure shows the distribution of each variable which are normally characterized by a high concentration in low values and a very long tail of the distribution. these distributions imply that few nodes concentrate most part of the relevance within the network. more surprisingly, same distributions are observed for twitter user' parameters such as number of followers or friends (following). the load centrality of a node is the fraction of all shortest paths that pass through that node. load centrality is slightly different than betweenness. the scatter plots shows that the is no significant correlation between variables except for the pair betweenness and load centralities as it is expected expected because they have similar definitions. this fact is remarkable as different centrality metrics provide a different perspective of leading nodes within the network and it does not necessarily correlates with the amount of related users, but also in the content dynamics. users were ranked using on variable as the reference. figure 2 shows the ranking resulting from using the eigenvalue centrality as the reference. the values were saturated to the percentile 95 of the distribution to improve visualization and avoid the effect of single values with very out of range values. this visualization confirms the lack of correlation between variables and the highly asymmetric distribution of the descriptors. figure 3 summarizes the values of each leader for each descriptor showing that even within the top ranked leaders there is a very large variability. this means that some nodes are singular events within the network that require further analysis to be interpreted, as they could be leaders in society or just a product of the network dynamics. figure 4 shows the ranking resulting from using current flow betweenness centrality as the reference. in this cases, the distribution of this reference variable is smoother and shows a more gradual behavior of leaders. to assess how the nodes with high relevance are distributed with projected the network into graphs by selecting the subgraph of nodes with a certain level of relevance (threshold on the network). the resulting network graphs may not be therefore connected. the eigenvalue-ranked graph shows high connectivity and very big nodes (see fig. 5 ). this is consistent with the definition of eigenvalue centrality that highlights how a node is connected to nodes that are also highly connected. this structure has implications in the reinforcement of specific messages and information within high connected clusters which can act as promoters of solutions or may become lobbies of information. the current flow betweenness shows an unconnected graph which is very interesting as decentralized nodes play a key role in transporting information through the network (see fig. 6 ). the current flow closeness shows also an unconnected graph which means that the social network is rather homogeneously distributed overall with parallel communities of information that do not necessarily interact with each other (see fig. 7 ). by increasing the size of the graph more clusters can be observed, specially in the eigenvalue-ranked network (fig. 8) . some clusters also appear for the current flow betweenness and current flow closeness (see fig.9 and 10). these clusters may have a key role in establishing bridges between different communities of practice, knowledge or region-determined groups. as the edges of the network are characterized in terms of flows between users, these bridges can be understood in terms of volume of information between communities. the distributions of the centrality metrics indicate that there are some nodes with massive relevance. these nodes can be seen as events within the flow of communication through the network [23] that require further contextualization to be interpreted. these nodes can propagate misinformation or make news or messages viral. further research is required to understand the cause of this massive relevance events, for instance, if it is related to a relevant concept or message or whether it is an emerging event of the network dynamics and topology. another way to assess these nodes is if they are consistently behaving this way along time or they are a temporal event. also, it may be necessary to contextualize with the type of content they normally spread to understand their exceptional relevance. besides the existence of massive relevance nodes, the quantification and understanding of the distribution of high relevant nodes has a lot of potential applications to spread messages to reach a wide number of users within the network. current flow betweenness particularly seems a good indicator to identify nodes to create a safety net in terms of information and positive messages. the distribution of the nodes could be approached for the general network or for different layers or subnetworks, isolated depending on several factors: type of interaction, type of content or some other behavioral pattern. experimental work is needed to test how a message either positive or negative spreads when started at one of the relevant nodes or close to the relevant nodes. for this purpose we are working towards integrating a network of concepts and the network of leaders. understanding the dynamics of narratives and concept spreading is key for a responsible use of social media for building up resilience against crisis. we also plan to make interactive graph visualization to browse the relevance of the network and dynamically investigate how relevant nodes are connected and how specific parts of the graph are ranked to really understand the distribution of the relevance variables as statistical parameters are not suitable to characterize a common pattern. it is necessary to make a dynamic ethical assessment of the potential applications of this study. understanding the network can be used to control purposes. however, we consider it is necessary that social media become the basis of pro-active response in terms of conceptual content and information. digital technologies must play a key role on building up resilience and tackle crisis. fake news detection on social media: a data mining perspective the science of fake news fake news and the economy of emotions: problems, causes, solutions. digital journalism social media and fake news in the 2016 election viral modernity? epidemics, infodemics, and the 'bioinformational'paradigm how to fight an infodemic. the lancet the covid-19 social media infodemic corona virus (covid-19)"infodemic" and emerging issues through a data lens: the case of china infodemic": leveraging high-volume twitter data to understand public sentiment for the covid-19 outbreak infodemic and risk communication in the era of cov-19 information flow during crisis management: challenges to coordination in the emergency operations center the signal code: a human rights approach to information during crisis quantifying information flow during emergencies measuring political polarization: twitter shows the two sides of venezuela false news on social media: a data-driven survey hate speech detection: challenges and solutions an emotional analysis of false information in social media and news articles declare: debunking fake news and false claims using evidence-aware deep learning csi: a hybrid deep model for fake news detection a deep neural network for fake news detection dynamical strength of social ties in information spreading impact of human activity patterns on the dynamics of information diffusion efficiency of human activity on information spreading on twitter multiple leaders on a multilayer social media the ties that lead: a social network approach to leadership. the leadership quarterly detecting opinion leaders and trends in online social networks exploring the potential for collective leadership in a newly established hospital network who takes the lead? social network analysis as a pioneering tool to investigate shared leadership within sports teams discovering leaders from community actions analyzing world leaders interactions on social media we would like to thank the center of innovation and technology for development at technical university madrid for support and valuable input, specially to xose ramil, sara romero and mónica del moral. thanks also to pedro j. zufiria, juan garbajosa, alejandro jarabo and carlos garcía-mauriño for collaboration. key: cord-025838-ed6itb9u authors: aljubairy, abdulwahab; zhang, wei emma; sheng, quan z.; alhazmi, ahoud title: siotpredict: a framework for predicting relationships in the social internet of things date: 2020-05-09 journal: advanced information systems engineering doi: 10.1007/978-3-030-49435-3_7 sha: doc_id: 25838 cord_uid: ed6itb9u the social internet of things (siot) is a new paradigm that integrates social network concepts with the internet of things (iot). it boosts the discovery, selection and composition of services and information provided by distributed objects. in siot, searching for services is based on the utilization of the social structure resulted from the formed relationships. however, current approaches lack modelling and effective analysis of siot. in this work, we address this problem and specifically focus on modelling the siot’s evolvement. as the growing number of iot objects with heterogeneous attributes join the social network, there is an urgent need for identifying the mechanisms by which siot structures evolve. we model the siot over time and address the suitability of traditional analytical procedures to predict future relationships (links) in the dynamic and heterogeneous siot. specifically, we propose a framework, namely siotpredict, which includes three stages: i) collection of raw movement data of iot devices, ii) generating temporal sequence networks of the siot, and iii) predicting relationships among iot devices which are likely to occur. we have conducted extensive experimental studies to evaluate the proposed framework using real siot datasets and the results show the better performance of our framework. crawling the internet of things (iot) to discover services and information in a trusted-oriented way remains a prolonged challenge [21] . many solutions have been introduced to overcome the challenge. however, due to the increasing number of iot objects in a tremendous rate, these solutions do not scale up. integrating social networking features into the internet of things (iot) paradigm has received an unprecedented amount of attention for the purpose of overcoming issues related to iot. there have been many attempts to integrate iot devices in social loops such as smart-its friend procedure [8] , blog-jects [3] , things that twitter [9] , and ericson project 1 . a new paradigm has emerged from this, called social internet of things (siot), and the key idea of this paradigm is to allow iot objects to establish relationships with each other independently with respect to the heuristics set by the owners of these objects [1, 2, 15, 18] . the perspective of siot is to incorporate the social behaviour of intelligent iot objects and allow them to have their own social networks autonomously. there are several benefits to the siot paradigm. first, siot can foster resource availability and enhance services discovery easily in a distributed manner using friends and friends of friends [1] , unlike traditional iot where search engines are employed to find services in a centralized way. second, the centralized manner of searching iot objects raises scalability issue, and siot overcomes the issue because each iot object can navigate the network structure of siot to reach other objects in a distributed way [2, 20, 21] . third, based on the social structure established among iot objects, things can inquire local neighbourhood for other objects to assess the reputation of these objects. fourth, siot enables objects to start new acquaintance where they can exchange information and experience. many research efforts have been devoted to realizing the siot paradigm. however, the majority of the research activities focused on identifying possible policies, methods and techniques for establishing relationships between smart devices autonomously and without any human intervention [1, 2] . in addition, several siot architectures have been proposed [2, 5, 17] . in spite of the intensive research attempts on siot, there are insufficient considerations to model and analyze the resulted siot networks. the nature of siot is dynamic because it can grow and change quickly over time where nodes (iot objects) and edges (relationships) appear or disappear. therefore, there is a growing interest in developing models that allow studying and understanding this evolving network, in particular, predicting the establishment of future links (relationships) [1, 15] . predicting future relationships among iot objects can be utilized for several applications such as service recommendation and service discovery. thus, there is a need for identifying the mechanisms by which siot structures evolve. this is a fundamental research question that has not been addressed in siot yet, and it forms the motivation for this work. however, the size and complexity of the siot network create a number of technical challenges. firstly, the nature of the resulted network structure is dynamic because smart devices can appear and disappear overtime and the existed relationships may vanish and new relationships may establish. secondly, siot is naturally structured as a heterogeneous graph with different types of entities and various relationships [2, 18] . finally, the size of siot network is mas-sive, and hence, it requires efficient and scalable methods. therefore, this paper focuses on modelling the siot network and study, in particular, the problem of predicting future relationships among iot objects. we study the possibility of relationship establishment among iot objects when there is co-occurrence meeting in time and space. our research question centers on how likely two iot objects could create a relationship between each other when they have been approximately on the same geographical location at the same time on multiple occasions. in our work, we develop the siotpredict framework, which includes three stages: i) collecting the raw movement data of iot devices, ii) generating temporal sequence networks of siot, and iii) predicting future relationships that may be established among things. the salient contributions of our study are summarized as follows: -designing and implementing the siotpredict framework for studying the siot network. the siotpredict framework consists of three main stages for i) collecting raw movement data of iot devices, ii) generating temporal sequence networks, and iii) predicting future relationships among things. to the best of our knowledge, our framework is the first on siot relationship prediction. -generating temporal sequence networks of siot. we develop two novel algorithms in the second stage of our framework. the first algorithm identifies the stays of iot objects and extracts the corresponding locations. the second algorithm, named sweep line time overlap, discovers when and where any two iot objects have met. -developing a bayesian nonparametric prediction model. we adopt the bayesian nonparametirc learning to build our prediction model. this model can adapt the new incoming observations due to the power representation and flexibility of bayesian nonparametric learning. -conducting comprehensive experiments to assess our framework. siotpredict has been evaluated by extensive experiments using real-world siot datasets [12] . the results demonstrate that our framework outperforms the existing methods. the rest of this paper is organized as follows. section 2 discusses the related works. section 3 presents heterogeneous graph modeling for social iot and introduces the siotpredict framework. the experimental results on real siot datasets are presented in sect. 4, and finally sect. 5 concludes the paper. siot is still in the infancy stage, and several efforts have been devoted to realizing the siot paradigm. most of the current research activities focused on identifying possible policies, methods and techniques for establishing relationships between smart devices autonomously and without any human intervention [2] . atzori et al. [2] proposed several relationships that can be established between iot objects as shown in table 1 . some of these relationships are static such as por and oor, which can usually be defined in advance. other relationships are dynamic and can be established when the conditions of the relationship are met. roopa et al. [18] defined more relationships that may be established among iot objects. nevertheless, current siot research lacks effective modelling and analysis of siot networks. however, in the context of iot, there are a few attempts to exploiting the relationships among smart devices and users for recommending things to users. yao et al. [23] proposed a hyper-graph based on users' social networks. they used existing relationships among users and their things to infer relationships among iot objects. they leveraged this resulted network for recommending things of interest to users. mashal et al. [14] modelled the relationships among users, objects, and services as a tripartite graph with hyper-edges between them. then they explored existing recommendation algorithms to recommend third-party services. nevertheless, these works are mainly based on users' existing relationships and the things they own. atzori et al. [1] emphasized to modelling and analyzing the resulted social graphs (uncorrelated to human social networks) among smart objects in order to introduce proper network analysis algorithms. therefore, our work aims to model the siot network in order to allow studying relationships prediction (link prediction) among iot objects that may form in the future. link prediction is considered as one of the most essential problems that have received much attention in network analysis, and in particular, when anticipating the network structure at a future time. a large body of work has investigated link prediction with various aspects including similarity-based measures, algorithmic methods, and probabilistic and statistical methods [11, 13] . recently, there is a growing interest in developing probabilistic network models using bayesian nonparametric learning. bayesian nonparametric is capable to capthis relationship is established among objects when they come into contact, sporadically or continuously, because they or their owners come in touch with each other during daily routine dynamic ture the network evolution in different time steps by finding latent structure in observed data. latent class models such as stochastic blockmodels (sbs) [7] and mixed membership stochastic blockmodels (mmsb) [19] depend on the vertex-exchangeability perspective where nodes are the target unit to assign into clusters. however, these models suffer from generating dense networks while most of the real-world networks tend to be sparse. to overcome this limitation, edgeexchangeable models have been proposed to deal with sparse networks [4, 22] . in this perspective, edges are the main units to assign into clusters. in this section, we describe the dynamic heterogeneous siot graph modelling and then present the details of our siotpredict framework. a dynamic, heterogeneous siot graph is composed of nodes and edges where nodes represent iot devices, and edges represent relationships that could be of multiple different types. the formal definition is as follows. an siot network can be considered as a temporal sequence of networks (as depicted in fig. 1a . , e t n } contains n edges observed at time t and the set of vertices v t is the set of vertices that have at least participated in one edge up to t such that represents the feature matrix at time t where x i is the attribute vector of node v n . throughout the paper, we consider the dynamic relationship establishment using siot data as a case study. figure 1b shows an example of the siot heterogeneous network. for this application, we assume that a heterogeneous siot graph has been obtained at time t from the siot data. given these data, we will predict the likelihood of a relationship (edge) creation between any two iot devices (nodes). this section explains our siotpredict framework for predicting future relationships in siot. figure 2 gives an overview of the siotpredict framework. the framework includes three stages namely: stage 1: collection of the raw movement data of iot devices, stage 2: generating the temporal sequence networks of siot, and stage 3: prediction future relationships of the siot. in the following, we will provide more details on these three stages. in the first stage of our framework, we collect the raw movement data of iot devices. we distinguish two types of iot devices: mobile and static. the coordinates of a static device (e.g., a light pole) are stationary and known whereas the coordinates of a mobile device (e.g., a bus) is dynamic and changing while the device is moving. we assume that mobile devices include gps technology which provides the location coordinates of these devices along with the timestamp. we also assume mobile iot devices send their location history records continuously (e.g., every 60 seconds). each record contains some important fields: (device id, latitude, longitude, and timestamp). 1. definition 1. a location history record is represented by a point on the earth (latitude, longitude) and a timestamp. this record tells where an iot object is at a specific time (as illustrated in fig. 3 ). iot object (as illustrated in fig. 3 ). phase 1) identifying "stays" from the raw movement data and extracting "locations". we are interested in knowing where objects meet. therefore, we first need to identify the stays for all objects using their raw movement data. then, we extract locations from these stays. this enables us to identify where and when iot objects have stayed. a stay is a sequence of n location history records, which can be represented by (longitude, latitude, start-time, end-time). longitude and latitude represent the average of longitude and latitude values in the sequence. start-time indicates the smallest timestamp in the sequence, and end-time represents the largest timestamp (see fig. 3 ). 2. definition 4. a location can be the latitude and longitude of one stay, or it can be the average of longitude and latitude of a group of stays. this group of stays are separated by less than or equal to a distance r (see fig. 3 ). we develop algorithm 1 for identifying stays, extracting locations and, then labelling identified stays by the extracted locations. the input of this algorithm is the raw movement data of iot objects. the output is a list of identified stays labelled by extracted locations. the time complexity of this algorithm is quadratic since it is required to calculate the distances between each two observations in the raw movement data. the first step focuses on stay identification (line ). this step is to identify the stays for each iot object from the given raw movement data. first, we define the time period of the stay (for example, when the value of stay period = 10, it means that we need to identify if an object stays at a place for 10 min). according to our assumption, each record in the raw movement data is sent every one minute, so 10 records represent 10 min. the algorithm calculates the distance among the raw movement data records according to eq. 1, where, d is the distance between the two location history records, r is the radius of the sphere (earth), θ 1 , θ 2 are the latitude of the two location history records, λ 1 , λ 2 are longitude of the two location history records. then, it groups them if their distances are less than or equal to a threshold r. from line (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) , the algorithm checks each group. if a group has a number of records larger than or equal to the value of the stay period, the algorithm takes the average of the latitude and longitude of the group. it also takes the smallest timestamp of the group to be the starting time of the stay, and the largest timestamp to be the ending time of the stay. (1) the second step targets location extraction (line 28-40). the algorithm extracts the list of locations out of the identified stays. since stays are represented by latitude and longitude, the algorithm calculates the distance among these stays, and groups them using a threshold r. the algorithm takes the average of the latitude and longitude of these stays to represent one location. if there is a stay which has not been grouped with any other stays, this stay can represent a location. finally the third step focuses on labelling stays with locations. in this step, we label the identified stays by one of the extracted locations. time overlap algorithm. we develop algorithm 2 to detect and report all overlapped periods occurred among the given set of stays produced by algorithm 1. the purpose is to determine if any two iot objects have met in a location at a particular time. this novel algorithm named as sweep line time overlap (slto). the slto algorithm is inspired by a sweep-line algorithm in geometry which finds intersections between a group of line segments. however, the slto algorithm identifies if there are overlapping periods among stays of the objects in a location. the idea of this algorithm is to run a virtual sweep-line parallel to y-axis and move from left to right in order to scan intervals on x-axis. when this sweep-line detects overlaps among stays (they look like line segments which represent the stay periods of objects), it starts calculating if there is an overlap and reporting these overlaps. there are two main steps. the first step focuses on storing all the intervals of the stays (line 2-3). since our goal is to find the overlapped periods among the set of stays, the algorithm initializes two data structures: i) a priority queue q to store all the intervals of the stays we got from algorithm 1 in sorting order, and ii) a sweep-line status as s to scan the stays from left to right. the second step runs the sweep line s (line 4-10). we get the interval end points from q one by one to allow the sweep line s to scan it. the sweep line detects the start of the stay in the space, and adds it to the s. also, when it detects the end of the stay (in this case, the slto finished from scanning the stay), the algorithm checks the last element in s. if it is the start of this stay, then the algorithm removes the stay form s with no action. if the last element in s is not the start of the finished stay, then an overlap or more have been detected between the this stay and other active stays in the s. algorithm 3 reports the overlaps discovered by slto. it calculates the amount of the detected overlaps, and the results are reported. the time complexity of slto is o(nlogn +l) since the stay time is only calculated when overlaps between objects exist (fig. 4) . overlapperiod ← min(l1.end, l2.end) − max(l1.start, l2.start) 6 -report the overlap (ids of objects and overlapperiod) phase 3) generating the temporal networks of siot. after obtaining the time overlapping periods of stays among iot objects, we are able to know the count of meetings occurred between any two iot objects. according to this, we check the rules of the targeted relationship such as how many times they have met and the length of the interval period. if the rules are met, then we build the temporal sequence of the siot composed from this relationship to be used in our prediction model stage. after generating the temporal sequence networks in the second stage, our next step is to model each one of them using the bayesian non-parametric model [22] . this model allows combining structure elucidation with a predictive performance by clustering links (edges) rather than the nodes. our aim here is to predict links (relationships) between iot objects that are likely to occur in the subsequent snapshot of the network. therefore, the siot network is modelled as an exchangeable sequence of observed links (relationships) and that allows adapting the growth of the network over time. we assume that the siot network clusters into groups, and for this, we model each community using a mixture of dirichlet network distributions. the description of the model as follows: we model the relationships of the siot network using dirichlet distribution g, where δ θi is a delta function centered on θ i , and π i is the corresponding probability of an edge to exist at θ i , with i=1 ∞ π i = 1. the parameter γ controls the total number of nodes in the network. to model the size and number of clusters, we use a stick-breaking distribution gem (α) with concentration parameter α that controls the number of the clusters. the model places a distribution over all clusters, and it places per-cluster distribution over the nodes. to generate an edge, first, a cluster will be picked according to d. then, two nodes (devices) will be sampled according to g. the probability of predicting a link between any two objects is proportional to the product of the degree of these two devices. for the inference part that is based on the bayes' rule (eq. 3), we follow the same steps conducted in [22] to compute the distribution over the cluster assignment using the chinese restaurant process and evaluate the predictive distribution over the n th link, given the previous n−1 links. we perform inference using an markov chain monte carlo (mcmc) scheme [22] . we evaluated the effectiveness and efficiency of the siotpredict framework based on comprehensive experiments. in this section, we discuss the experimental design and report the results. we used the siot datasets 2 to evaluate the siotpredict framework. these datasets are based on real iot objects available in the city of santander and contain a description of iot objects. each object is represented by fields such as (device id, id user, device type, device brand, device model). the total number of iot objects is 16,216. 14,600 objects are from private users and 1,616 are from public services. the dataset includes the raw movement data of devices that are owned by users and the smart city. there are two kinds of devices: static devices and mobile devices. static devices are represented by fixed latitudes and longitudes. mobile devices are represented by latitudes, longitudes, and timestamps. the latitude and longitude values of mobile devices are dynamic. in addition, the dataset includes an adjacency matrix for siot relationship produced with some defined parameters. in table 2 , we only depict sor and sor2 relationships and their parameters to be used in our experiments. in this section, we explain the common metrics and the comparison methods. performance metrics. our performance metrics used in the experiments include accuracy, precision, recall, and f 1 score. following the work of information diffusion in [6] , we define the accuracy as the ratio of correctly predicted edges to the total edges in the true network, precision as the fraction of edges in the predicted network that are also present in the true network, recall as the fraction of edges of the true network that are also presented in the predicted network, and finally f 1 score as the weight average of precision and recall. [19] . although the aforementioned models are not explicitly designed for link prediction, they can be modified for the prediction task using the above procedure of selecting the n highest probability edges [22] . in addition, these models suffer from the limitation of assuming a fixed number of vertices. furthermore, we also compared our approach with common link prediction methods [10] : resource allocation, adamic adar index, jaccard coefficient, xgboost, and common neighbor. (a) roc curve using nodes in the training set. (b) roc curve using nodes outside the training set. we modeled the siot network to predict future interactions among devices, and that enabled us to have a better understanding on the resulted network. based on the existing bayesian models, nodes are assigned to clusters, and these clusters control the way of how these nodes establish relationships. figure 5 and fig. 6 show the performance of siotpredict against other methods. we used a small network, and the reason of that is due to the nature of sb and mmsb, which do not scale very well on large networks [16] . we experimented the performance of these models and methods in two ways. in the first experiment, we used the same nodes (i.e., iot objects) in the training set and the test set. that means there were no nodes in the test set outside the training set. we performed experiments in this way because sb, mmsb and other methods assume nodes in the test set are not outside the training set. in the second experiment, the nodes in the test set are outside the training set. for the overall performance, sb does not perform well against the mmsb and our model. the reason is that the assumption of sb states that nodes can only belong to one cluster whereas we see mmsb performs better than sb because it relaxes this assumption by allowing the nodes to belong to more than one clusters. however, sb, mmsb and other common methods do not perform well compared to our model on both settings as illustrated in fig. 5 and fig. 6 . in particular, these methods perform poorly in the second setting (i.e., the nodes in the test set are outside the training set) due to their limitation on dealing with new nodes. in contrast, our model delivers the similar performance. social internet of things (siot) can foster and enhance resource availability, discovering services, assessing object reputations, composing services, exchanging information and experience. in addition, siot enables establishing new acquaintances, collaborating to achieve common goals, and exploiting other object capabilities. therefore, instead of relying on centralized search engine, social structure resulted from the created relationships can be utilized in order to find the desired services. in this paper, we take the research line of siot to a new dimension by proposing the siotpredict framework that addresses the link prediction problem in the siot paradigm. this framework contains three stages: i) collecting raw data movement of iot devices, ii) generating temporal sequence networks of siot, and iii) predicting the links that are likely form between iot objects in the future. ongoing work includes further assessment of the siotpredict framework, and enhancement of the relationship prediction by considering the features of iot objects (e.g., services offered by the objects). from smart objects to social objects: the next evolutionary step of the internet of things the social internet of things (siot) -when social networks meet the internet of things: concept, architecture and network characterization a manifesto for networked objects -cohabiting with pigeons, arphids and aibos in the internet of things edge-exchangeable graphs and sparsity lysis: a platform for iot distributed applications over socially connected objects inferring networks of diffusion and influence stochastic blockmodels: first steps smart-its friends: a technique for users to easily establish connections between smart artefacts things that twitter: social networks and the internet of things the link-prediction problem for social networks link prediction in complex networks: a survey a dataset for performance analysis of the social internet of things a survey of link prediction in complex networks analysis of recommendation algorithms for internet of things friendship selection in the social internet of things: challenges and possible strategies bayesian models of graphs, arrays and other exchangeable random structures the cluster between internet of things and social networks: review and research challenges social internet of things (siot): foundations, thrust areas, systematic review and future directions estimation and prediction for stochastic blockmodels for graphs with latent block structure searching the web of things: state of the art, challenges, and solutions internet of things search engine nonparametric network models for link prediction things of interest recommendation by leveraging heterogeneous relations in the internet of things key: cord-134926-dk28wutc authors: dasgupta, anirban; sengupta, srijan title: scalable estimation of epidemic thresholds via node sampling date: 2020-07-28 journal: nan doi: nan sha: doc_id: 134926 cord_uid: dk28wutc infectious or contagious diseases can be transmitted from one person to another through social contact networks. in today's interconnected global society, such contagion processes can cause global public health hazards, as exemplified by the ongoing covid-19 pandemic. it is therefore of great practical relevance to investigate the network trans-mission of contagious diseases from the perspective of statistical inference. an important and widely studied boundary condition for contagion processes over networks is the so-called epidemic threshold. the epidemic threshold plays a key role in determining whether a pathogen introduced into a social contact network will cause an epidemic or die out. in this paper, we investigate epidemic thresholds from the perspective of statistical network inference. we identify two major challenges that are caused by high computational and sampling complexity of the epidemic threshold. we develop two statistically accurate and computationally efficient approximation techniques to address these issues under the chung-lu modeling framework. the second approximation, which is based on random walk sampling, further enjoys the advantage of requiring data on a vanishingly small fraction of nodes. we establish theoretical guarantees for both methods and demonstrate their empirical superiority. infectious diseases are caused by pathogens, such as bacteria, viruses, fungi, and parasites. many infectious diseases are also contagious, which means the infection can be transmitted from one person to another when there is some interaction (e.g., physical proximity) between them. today, we live in an interconnected world where such contagious diseases could spread through social contact networks to become global public health hazards. a recent example of this phenomenon is the covid-19 outbreak caused by the so-called novel coronavirus (sars-cov-2) that has spread to many countries zhu et al., 2020; wang et al., 2020; sun et al., 2020) . this recent global outbreak has caused serious social and economic repercussions, such as massive restrictions on movement and share market decline (chinazzi et al., 2020) . it is therefore of great practical relevance to investigate the transmission of contagious diseases through social contact networks from the perspective of statistical inference. consider an infection being transmitted through a population of n individuals. according to the susceptible-infected-recovered (sir) model of disease spread, the pathogen can be transmitted from an infected person (i) to a susceptible person (s) with an infection rate given by β, and an infected individual becomes recovered (r) with a recovery rate given by µ. this can be modeled as a markov chain whose state at time t is given by a vector (x t 1 , . . . , x t n ), where x t i denotes the state of the i th individual at time t, i.e., x t i ∈ {s, i, r}. for the population of n individuals, the state space of this markov chain becomes extremely large with 3 n possible configurations, which makes it impractical to study the exact system. this problem was addressed in a series of three seminal papers by kermack and mckendrick (kermack and mckendrick, 1927 , 1932 , 1933 . instead of modeling the disease state of each individual at at a given point of time, they proposed compartmental models, where the goal is to model the number of individuals in a particular disease state (e.g., susceptible, infected, recovered) at a given point of time. since their classical papers, there has been a tremendous amount of work on compartmental modeling of contagious diseases over the last ninety years (hethcote, 2000; van den driessche and watmough, 2002; brauer et al., 2012) . compartmental models make the unrealistic assumption of homogeneity, i.e., each individual is assumed to have the same probability of interacting with any other individual. in reality, individuals interact with each other in a highly heterogeneous manner, depending upon various factors such as age, cultural norms, lifestyle, weather, etc. the contagion process can be significantly impacted by heterogeneity of interactions rocha et al., 2011; galvani and may, 2005; woolhouse et al., 1997) , and therefore compartmental modeling of contagious diseases can lead to substantial errors. in recent years, contact networks have emerged as a preferred alternative to compartmental models (keeling, 2005) . here, a node represents an individual, and an edge between two nodes represent social contact between them. an edge connecting an infected node and a susceptible node represents a potential path for pathogen transmission. this framework can realistically represent the heterogeneous nature of social contacts, and therefore provide much more accurate modeling of the contagion process than compartmental models. notable examples where the use of contact networks have led to improvements in prediction or understanding of infectious diseases include bengtsson et al. (2015) and kramer et al. (2016) . consider the scenario where a pathogen is introduced into a social contact network and it spreads according to an sir model. it is of particular interest to know whether the pathogen will die out or lead to an epidemic. this is dictated by a set of boundary conditions known as the epidemic threshold, which depends on the sir parameters β and µ as well as the network structure itself. above the epidemic threshold, the pathogen invades and infects a finite fraction of the population. below the epidemic threshold, the prevalence (total number of infected individuals) remains infinitesimally small in the limit of large networks (pastor-satorras et al., 2015) . there is growing evidence that such thresholds exist in real-world host-pathogen systems, and intervention strategies are formulated and executed based on estimates of the epidemic threshold. (dallas et al., 2018; shulgin et al., 1998; wallinga et al., 2005; pourbohloul et al., 2005; meyers et al., 2005) . fittingly, the last two decades have seen a significant emphasis on studying epidemic thresholds of contact networks from several disciplines, such as computer science, physics, and epidemiology (newman, 2002; wang et al., 2003; colizza and vespignani, 2007; chakrabarti et al., 2008; gómez et al., 2010; wang et al., 2016 . see leitch et al. (2019) for a complete survey on the topic of epidemic thresholds. concurrently but separately, network data has rapidly emerged as a significant area in statistics. over the last two decades, a substantial amount of methodological advancement has been accomplished in several topics in this area, such as community detection (bickel and chen, 2009; zhao et al., 2012; rohe et al., 2011; sengupta and chen, 2015) , model fitting and model selection (hoff et al., 2002; handcock et al., 2007; krivitsky et al., 2009; wang and bickel, 2017; yan et al., 2014; bickel and sarkar, 2016; sengupta and chen, 2018) , hypothesis testing (ghoshdastidar and von luxburg, 2018; tang et al., 2017a,b; bhadra et al., 2019) , and anomaly detection (zhao et al., 2018; sengupta, 2018; komolafe et al., 2019) , to name a few. the state-of-the-art toolbox of statistical network inference includes a range of random graph models and a suite of estimation and inference techniques. however, there has not been any work at the intersection of these two areas, in the sense that the problem of estimating epidemic thresholds has not been investigated from the perspective of statistical network inference. furthermore, the task of computing the epidemic threshold based on existing results can be computationally infeasible for massive networks. in this paper, we address these gaps by developing a novel sampling-based method to estimate the epidemic threshold under the widely used chung-lu model (aiello et al., 2000) , also known as the configuration model. we prove that our proposed method has theoretical guarantees for both statistical accuracy and computational efficiency. we also provide empirical results demonstrating our method on both synthetic and real-world networks. the rest of the paper is organized as follows. in section 2, we formally set up the prob-lem statement and formulate our proposed methods for approximating the epidemic threshold. in section 3, we desribe the theoretical properties of our estimators. in section 4, we report numerical results from synthetic as well as real-world networks. we conclude the paper with discussion and next steps in section 5. definition and description λ(a) spectral radius of the matrix a d i degree of the node i of the network δ i expected degree of the node i of the network s(t), i(t), r(t) number of susceptible (s), infected (i), and recovered/removed (r) individuals in the population at time t β infection rate: probability of transmission of a pathogen from an infected individual to a susceptible individual per effective contact (e.g. contact per unit time in continuous-time models, or per time step in discrete-time models) µ recovery rate: probability that an infected individual will recover per unit time (in continuous-time models) or per time step (in discrete-time models) consider a set of n individuals labelled as 1, . . . , n, and an undirected network (with no self-loops) representing interactions between them. this can represented by an nby-n symmetric adjacency matrix a, where a(i, j) = 1 if individuals i and j interact and a(i, j) = 0 otherwise. consider a pathogen spreading through this contact network according to an sir model. from existing work (chakrabarti et al., 2008; gómez et al., 2010; prakash et al., 2010; wang et al., 2016 , we know that the boundary condition for the pathogen to become an epidemic is given by where λ(a) is the spectral radius of the adjacency matrix a. the left hand side of equation (1) is the ratio of the infection rate to the recovery rate, which is purely a function of the pathogen and independent of the network. as this ratio grows larger, an epidemic becomes more likely, as new infections outpace recoveries. the right hand side of equation (1) is the spectral radius of the adjacency matrix, which is purely a function of the network and independent of the pathogen. larger the spectral radius, the more connected the network, and therefore an epidemic becomes more likely. thus, the boundary condition in equation (1) connects the two aspects of the contagion process, the pathogen transmissibility which is quantified by β/µ, and the social contact network which is quantified by the spectral radius. if β µ < 1 λ(a) , the pathogen dies out, and if β µ > 1 λ(a) , the pathogen becomes an epidemic. given a social contact network, the inverse of the spectral radius of its adjacency matrix represents the epidemic threshold for the network. any pathogen whose transmissiblity ratio is greater than this threshold is going to cause an epidemic, whereas any pathogen whose transmissiblity ratio is less than this threshold is going to die out. therefore, a key problem in network epidemiology is to compute the spectral radius of the social contact network. realistic urban social networks that are used in modeling contagion processes have millions of nodes (eubank et al., 2004; barrett et al., 2008) . to compute the epidemic threshold of such networks, we need to find the largest (in absolute value) eigenvalue of the adjacency matrix a. this is challenging because of two reasons. 1. first, from a computational perspective, eigenvalue algorithms have computational complexity of ω(n 2 ) or higher. for massive social contact networks with millions of nodes, this can become too burdensome. 2. second, from a statistical perspective, eigenvalue algorithms require the entire adjacency matrix for the full network of n individuals. it can be challenging or expensive to collect interaction data of n individuals of a massive population (e.g., an urban metropolis). furthermore, eigenvalue algorithms typically require the full matrix to be stored in the random-access memory of the computer, which can be infeasible for massive social contact networks which are too large to be stored. the first issue could be resolved if we could compute the epidemic threshold in a computationally efficient manner. the second issue could be resolved if we could compute the epidemic threshold only using data on a small subset of the population. in this paper, we aim to resolve both issues by developing two approximation methods for computing the spectral radius. to address these problems, let us look at the spectral radius, λ(a), from the perspective of random graph models. the statistical model is given by a ∼ p , which is short-hand for a(i, j) ∼ bernoulli(p (i, j)) for 1 ≤ i < j ≤ n. then λ(a) converges to λ(p ) in probability under some mild conditions (chung and radcliffe, 2011; benaych-georges et al., 2019; bordenave et al., 2020) . to make a formal statement regarding this convergence, we reproduce below a slightly paraphrased version (for notational consistency) of an existing result in this context. lemma 1 (theorem 1 of chung and radcliffe (2011)). let be the maximum expected degree, and suppose that for some > 0, ∆ > 4 9 log(2n/ ) for sufficiently large n. then with probability at least 1 − , for sufficiently large n, to make note of a somewhat subtle point: from an inferential perspective it is tempting to view the above result as a consistency result, where λ(p ) is the population quantity or parameter of interest and λ(a) is its estimator. however, in the context of epidemic thresholds, we are interested in the random variable λ(a) itself, as we want to study the contagion spread conditional on a given social contact network. therefore, in the present context, the above result should not be interpreted as a consistency result. rather, we can use the convergence result in a different way. for massive networks, the random variable λ(a), which we wish to compute but find it infeasible to do so, is close to the parameter λ(p ). suppose we can find a random variable t (a) which also converges in probability to λ(p ), and is computationally efficient. since t (a) and λ(a) both converge in probability to λ(p ), we can use t (a) as an accurate proxy for λ(a). this would address the first of the two issues described at the beginning of this subsection. furthermore, if t (a) can be computed from a small subset of the data, that would also solve the second issue. this is our central heuristic, which we are going to formalize next. so far, we have not made any structural assumptions on p , we have simply considered the generic inhomogeneous random graph model. under such a general model, it is very difficult to formulate a statistic t (a) which is cheap to compute and converges to λ(p ). therefore, we now introduce a structural assumption on p , in the form of the well-known chung-lu model that was introduced by aiello et al. (2000) and subsequently studied in many papers (chung and lu, 2002; chung et al., 2003; decreusefond et al., 2012; pinar et al., 2012; zhang et al., 2017) . for a network with n nodes, let δ = (δ 1 , . . . , δ n ) be the vector of expected degrees. then under the chung-lu model, this formulation preserves e[d i ] = δ i , where d i is the degree of the i th node, and is very flexible with respect to degree heterogeneity. under model (2), note that rank(p ) = 1, and we have recall that we are looking for some computationally efficient t (a) which converges in probability to λ(p ). we now know that under the chung-lu model, λ(p ) is equal to the ratio of the second moment to the first moment of the degree distribution. therefore, a simple estimator of λ(p ) is given by the sample analogue of this ratio, i.e., ( we now want to demonstrate that approximating λ(a) by t 1 (a) provides us with very substantial computational savings with little loss of accuracy. the approximation error can be quantified as and our goal is to show that e 1 (a) → 0 in probability, while the computational cost of t 1 (a) is much smaller than that of λ(a). we will show this both from a theoretical perspective and an empirical perspective. we next describe the empirical results from a simulation study, and we postpone the theoretical discussion to section 3 for organizational clarity. we used n = 5000, 10000, and constructed a chung-lu random graph model where p (i, j) = θ i θ j . the model parameters θ 1 , . . . , θ n were uniformly sampled from (0, 0.25). then, we randomly generated 100 networks from the model, and computed λ(a) and t 1 (a). the results are reported in table 2 . average runtime for the moment based estimator, t 1 (a), is only 0.07 seconds for n = 5000 and 0.35 seconds for n = 10000, whereas for the spectral radius, λ(a), it is 78.2 seconds and 606.44 seconds respectively, which makes the latter 1100-1700 times more computationally burdensome. the average error for t 1 (a) is very small, and so is the sd of errors. thus, even for moderately sized networks where n = 5000 or n = 10000, using t 1 (a) as a proxy for λ(a) can reduce the computational cost to a great extent, and the corresponding loss in accuracy is very small. for massive networks where n is in millions, this advantage of t 1 (a) over λ(a) is even greater; however, the computational burden for λ(a) becomes so large that this case is difficult to illustrate using standard computing equipment. thus, t 1 (a) provides us with a computationally efficient and statistically accurate method for finding the epidemic threshold. the first approximation, t 1 (a), provides us with a computationally efficient method for finding the epidemic threshold. this addresses the first issue pointed out at the beginning of section 2.1. however, computing t 1 (a) requires data on the degree of all n nodes of the network. therefore, this does not solve the second issue pointed out at the beginning of section 2.1. we now propose a second alternative, t 2 , to address the second issue. the idea behind this approximation is based on the same heuristic that was laid out in section 2.2. since λ(p ) is a function of degree moments, we can estimate these moments using observed node degrees. in defining t 1 (a), we used observed degrees of all n nodes in the network. however, we can also estimate the degree moments by considering a small sample of nodes, based on random walk sampling. the algorithm for computing t 2 is given in algorithm 1. algorithm 1 randomwalkestimate 1: procedure estimate(g, r, t * ) 2: x ← 1. while t ≤ t * do 4: x ← random neighbor of x, chosen uniformly. v ← 0. while i ≤ r do x ← random neighbor of x, chosen uniformly. 9: return t 2 = v/r. note that we only use (t * + r) randomly sampled nodes for computing t 2 , which implies that we do not need to collect or store data on the n individuals. therefore this method overcomes the second issue pointed out at the beginning of section 2.1. the approximation error arising from this method can be defined as and we want to show that e 2 (a) → 0 in probability, while the data-collection cost of t 2 (a) is much less than that of t 1 (a). in the next section, we are going to formalize this. in this section, we are going to establish that the approximation errors e 1 (a) and e 2 (a), defined in equations (4) and (5), converge to zero in probability. from theorem 2.1 of chung et al. (2003) , we know that when holds, then for any > 0, therefore, under (6), it suffices to show that, for any > 0, we would like to show that, under reasonable conditions, for any > 0, we will show that for any > 0, we first prove that (8) implies (7). equation (8) note that m 2 /m 1 is a strictly increasing function of m 2 and a strictly decreasing function of m 1 . therefore, for outcomes belonging to the above event, note that 1 − 1 − 1 + = 2 1 + < 2 , and 1 + 1 − − 1 = 2 1 − < 4 , given that < 1/2. now, fix > 0 and let = /4. then, thus, proving (8) is sufficient for proving (7). next, we state and prove the theorem which will establish (8). theorem 2. if the average of the expected degrees goes to infinity, i.e., 1 n i δ i → ∞, and the spectral radius dominates log 2 (n), i.e., i δ 2 i i δ i = ω(log 2 n), then for any > 0, proof. we will use hoeffding's inequality (hoeffding, 1994) for the first part, and we begin by stating the inequality for the sum of bernoulli random variables. let b 1 , . . . , b m be m independent (but not necessarily identically distributed) bernoulli random variables, and s m = m i=1 b i . then for any t > 0, in our case, and we know that {a(i, j) : 1 ≤ i < j ≤ n} are independent bernoulli random variables. fix > 0 and note that e[ i λ min (l) of the laplacian of g, it follows above that ε(q) = 1−λ 2 (q) = 1−λ 2 (d −1/2 ad −1/2 ) = λ n−1 (i−d −1/2 ad −1/2 ) = 1 − o(1). putting these together, we get the following corollary on the total number of node queries. corollary 6.1. for a graph generated from the expected degrees model, with probability 1 − 1/n, algorithm 1, needs to query ≤ 6dmax d min , but this is a loose bound, better bounds can be derived for power law degree distributions, for instance. thus, we have proved that the approximation error for t 2 (a) goes to zero in probability. in addition, corollary 6.1 shows that the number of nodes that we need to query in order to have an accurate approximation is much smaller than n. furthermore, computing t 2 only requires node sampling and counting degrees, and therefore the runtime is much smaller than eigenvalue algorithms. therefore, t 2 (a) is a computationally efficient and statistically accurate approximation of the epidemic threshold, while also requiring a much smaller data budget compared to t 1 (a). in this section, we characterize the empirical performance of our sampling algorithm on two synthetic networks, one generated from the chung-lu model and the second generated from the preferential attachment model of . our first dataset is a graph generated from the chung-lu model of expected degrees. we generated a powerlaw sequence (i.e. fraction of nodes with degree d is proportion data nodes edges λ(a) t 1 (a) chung-lu 50k 72k 43.83 48.33 pref-attach 50k 250k 37 32.8 table 3 : statistics of the two synthetic datasets used. to d −β ) with exponent β = 2.5 and then generated a graph with this sequence as the expected degrees. table 3 notes that, as expected, the first eigenvalue the second dataset is generated from the preferential attachment model , where each incoming node adds 5 edges to the existing nodes, the probability of choosing a specific node as neighbor being proportional to the current degree of that node. while the preferential attachment model naturally gives rise to a directed graph, we convert the graph to an undirected one before running our algorithm. it is interesting to note that even in this case the chung-lu model does not hold, our first approximation, t 1 (a), is close to λ(a). in each of the networks, the random walk algorithm presented in algorithm 1 was used for sampling. the random walk was started from an arbitrary node and every 10 th node was sampled (to account for the mixing time) from the walk. these samples were then used to calculate t 2 (a). this experiment was repeated 10 times. these gave estimates t 1 2 , . . . , t 10 2 . we then calculate two relative errors ∀i ∈ {1, 2, . . . , 10}, we plot the averages of { t 1−t 2 i } and { λ−t 2 i } against the actual number of nodes seen by the random walk. note that the x-axis accurately reflect how many times the algorithm actually queried the network, not just the number of samples used. measuring the cost of uniform node sampling in this setting, for instance, would need to keep track of how many nodes are touched by a metropolis-hastings walk that implements the uniform distribution. figure 1 demonstrates the results. for the two synthetic networks, the algorithm is able to get a 10% approximation to the statistic t 1 (a) by exploring at most 10% of the network. with more samples from the random walk, the mean relative errors settle to around 4-5%. however, once we measure the mean relative errors with respect to λ(a), it becomes clearer that the estimator t 2 (a) does better when the graph is closer to the assumed (i.e. chung-lu) model. for the chung-lu graph, the mean error λ−t 2 essentially is very similar to t 1−t 2 , which is to be expected. for the preferential attachment graph too, it is clear that the estimate t 2 is able to achieve a better than 10% relative error approximation of λ(a). note that, if we were instead counting only the nodes whose degrees were actually used for estimation, the fraction of network used would be roughly 1 − 2% in all the cases, the majority of the node cost actually goes in making the random walk mix. in this work, we investigated the problem of computing sir epidemic thresholds of social contact networks from the perspective of statistical inference. we considered the two challenges that arise in this context, due to high computational and data-collection complexity of the spectral radius. for the chung-lu network generative model, the spectral radius can be characterized in terms of the degree moments. we utilized this fact to develop two approximations of the spectral radius. the first approximation is computationally efficient and statistically accurate, but requires data on observed degrees of all nodes. the second approximation retains the computationally efficiency and statistically accuracy of the first approximation, while also reducing the number of queries or the sample size quite substantially. the results seem very promising for networks arising from the chung-lu and preferential attachment generative models. there are several interesting and important future directions. the methods proposed in this paper have provable guarantees only under the chung-lu model, although it works very well under the preferential attachment model. this seems to indicate that the degree based approximation might be applicable to a wider class of models. on the other hand, this leaves open the question of developing a better "model-free" estimator, as well as asking similar questions about other network features. in this work we only considered the problem of accurate approximation of the epidemic threshold. from a statistical as well as a real-world perspective, there are several related inference questions. these include uncertainty quantification, confidence intervals, onesample and two-sample testing, etc. social interaction patterns vary dynamically over time, and such network dynamics can have significant impacts on the contagion process leitch et al. (2019) . in this paper we only considered static social contact networks, and in future we hope to study epidemic thresholds for time-varying or dynamic networks. we do realize that in the face of the current pandemic, while it is important to pursue research relevant to it, it is also important to be responsible in following the proper scientific process. we would like to state that in this work, the question of epidemic threshold estimation has been formalized from a theoretical viewpoint in a much used, but simple, random graph model. we are not yet at a position to give any guarantees about the performance of our estimator in real social networks. we do hope, however, that the techniques developed here can be further refined to work to give reliable estimators in practical settings. a random graph model for massive graphs emergence of scaling in random networks emergence of scaling in random networks episimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks largest eigenvalues of sparse inhomogeneous erdős-rényi graphs using mobile phone data to predict the spatial spread of cholera a bootstrap-based inference framework for testing similarity of paired networks a nonparametric view of network models and newman-girvan and other modularities hypothesis testing for automated community detection in networks spectral radii of sparse random matrices. annales de l'institut henri poincare (b) probability and statistics mathematical models in population biology and epidemiology epidemic thresholds in real networks the effect of travel restrictions on the spread of the 2019 novel coronavirus the average distances in random graphs with given expected degrees eigenvalues of random power law graphs on the spectra of general random graphs. the electronic journal of combinatorics invasion threshold in heterogeneous metapopulation networks experimental evidence of a pathogen invasion threshold large graph limit for an sir process in random network with heterogeneous connectivity modelling disease outbreaks in realistic urban social networks dimensions of superspreading practical methods for graph two-sample testing discretetime markov chain approach to contact-based disease spreading in complex networks model-based clustering for social networks the mathematics of infectious diseases probability inequalities for sums of bounded random variables latent space approaches to social network analysis clinical features of patients infected with 2019 novel coronavirus in wuhan, china. the lancet the implications of network structure for epidemic dynamics containing papers of a mathematical and physical character contributions to the mathematical theory of epidemics. ii.the problem of endemicity contributions to the mathematical theory of epidemics. iii.further studies of the problem of endemicity statistical evaluation of spectral methods for anomaly detection in static networks spatial spread of the west africa ebola epidemic representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models toward epidemic thresholds on temporal networks: a review and open questions chernoff-type bound for finite markov chains network theory and sars: predicting outbreak diversity spread of epidemic disease on networks epidemic processes in complex networks the similarity between stochastic kronecker and chung-lu graph models modeling control strategies of respiratory pathogens got the flu (or mumps)? check the eigenvalue! simulated epidemics in an empirical spatiotemporal network of 50,185 sexual contacts spectral clustering and the high-dimensional stochastic blockmodel anomaly detection in static networks using egonets spectral clustering in heterogeneous networks. statistica sinica a block model for node popularity in networks with community structure pulse vaccination strategy in the sir epidemic model early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study. the lancet digital health a semiparametric two-sample hypothesis testing problem for random graphs a nonparametric two-sample hypothesis testing problem for random graphs reproduction numbers and subthreshold endemic equilibria for compartmental models of disease transmission a measles epidemic threshold in a highly vaccinated population a novel coronavirus outbreak of global health concern predicting the epidemic threshold of the susceptible-infected-recovered model unification of theoretical approaches for epidemic spreading on complex networks epidemic spreading in real networks: an eigenvalue viewpoint likelihood-based model selection for stochastic block models heterogeneities in the transmission of infectious agents: implications for the design of control programs model selection for degree-corrected block models random graph models for dynamic networks performance evaluation of social network anomaly detection using a moving windowbased scan method consistency of community detection in networks under degree-corrected stochastic block models a novel coronavirus from patients with pneumonia in china key: cord-200147-ans8d3oa authors: arimond, alexander; borth, damian; hoepner, andreas; klawunn, michael; weisheit, stefan title: neural networks and value at risk date: 2020-05-04 journal: nan doi: nan sha: doc_id: 200147 cord_uid: ans8d3oa utilizing a generative regime switching framework, we perform monte-carlo simulations of asset returns for value at risk threshold estimation. using equity markets and long term bonds as test assets in the global, us, euro area and uk setting over an up to 1,250 weeks sample horizon ending in august 2018, we investigate neural networks along three design steps relating (i) to the initialization of the neural network, (ii) its incentive function according to which it has been trained and (iii) the amount of data we feed. first, we compare neural networks with random seeding with networks that are initialized via estimations from the best-established model (i.e. the hidden markov). we find latter to outperform in terms of the frequency of var breaches (i.e. the realized return falling short of the estimated var threshold). second, we balance the incentive structure of the loss function of our networks by adding a second objective to the training instructions so that the neural networks optimize for accuracy while also aiming to stay in empirically realistic regime distributions (i.e. bull vs. bear market frequencies). in particular this design feature enables the balanced incentive recurrent neural network (rnn) to outperform the single incentive rnn as well as any other neural network or established approach by statistically and economically significant levels. third, we half our training data set of 2,000 days. we find our networks when fed with substantially less data (i.e. 1,000 days) to perform significantly worse which highlights a crucial weakness of neural networks in their dependence on very large data sets ... while leading papers on machine learning in asset pricing focus on predominantly returns and stochastic discount factors (chen, pelger & zhu 2020; gu, kelly & xiu 2020) , we are motivated by the global coid-19 virus crisis and the subsequent stock market crash to investigate if and how machine learning methods can enhance value at risk (var) threshold estimates. in line with gu, kelly & xiu's (2020: 7) , we like to open by disclaiming our awareness that " [m] achine learning methods on their own do not identify deep fundamental associations" .without human scientists designing hypothesized mechanisms into an estimation problem. 1 nevertheless, measurement errors can be reduced based on machine learning methods. hence, machine learning methods employed as means to an end instead of as end in themselves can significantly support researchers in challenging estimation tasks. 2 in their already legendary paper, gu, kelly & xiu (gkx in the following, 2020) apply machine learning to a key problem in academic finance literature: 'measuring asset risk premia'. they observe that machine learning improves the description of expected returns relative to traditional econometric forecasting methods based on (i) better out-ofsample r-squared and (ii) forecasts earning larger sharpe ratios. more specifically, they compare four 'traditional' methods (ols, glm, pcr/pca, pls) with regression trees (e.g. random forests) and a simple 'feed forward neural network' based on 30k stocks over 720 months , using 94 firm characteristics, 74 sectors and 900+ baseline signals. crediting inter alia (i) flexibility of functional form and (ii) enhanced ability to prioritize vast sets of baseline signals, they find the feed forward neural networks (ffnn) to perform best. contrary to results reported from computer vision, gkx further observe that "'shallow' learning outperforms 'deep' learning" (p.47), as their neural network with 3 hidden layers excels beyond neural networks with more hidden layers. they interpret this result as a consequence of a relatively much lower signal to noise ratio and much smaller data sets in finance. interestingly, the outperformance of nns over the other 5 methods widens at portfolio compared to stock level, another indication that an understanding of the signal to noise ratio in financial markets is crucial when training neural networks. that said, while classic ols is statistically significantly weaker than all other models, nn3 beats all others but not always at statistically significant levels. gkx finally confirm their results via monte carlo simulations. they show that if one generated two hypothetical security price datasets, one linear and un-interacted and one nonlinear and interactive, ols and glm would dominate in former, while nns dominate in the latter. they conclude by attributing the "predictive advantage [of neural networks] to accommodation of nonlinear interactions that are missed by other methods." (p.47) following gkx, an extensive literature on machine learning in finance is rapidly emerging. chen, pelger and zhu (cpz in the following, 2020) introduce more advanced (i.e. recurrent) neural networks and estimate a (i) non-linear asset pricing model (ii) regularized under no-arbitrage conditions operationalized via a stochastic discount factor (iii) while considering economic conditions. in particular they attribute the time varying dependency of the stochastic discount factor of about ten thousand us stocks to macroeconomic state processes via a recurrent long short term memory (lstm) network. in cpz's (2020: 5) view "it is essential to identify the dynamic pattern in macroeconomic time series before feeding them into a machine learning model". avramov et al. (2020) replicate the approaches of gkx's (2020) , cpz (2020) , and two conditional factor pricing models: kelly, pruitt, and su's (2019) linear instrumented principal component analysis (ipca) and gu, kelly, and xiu's (2019) nonlinear conditional autoencoder in the context of real-world economic restrictions. while they find strong fama french six factor (ff6) adjusted returns in the original setting without real world economic constraints, these returns reduce by more than half if microcaps or firms without credit ratings are excluded. in fact, when avramov et al. (2020: 3) are "[e]xcluding distressed firms, all deep learning methods no longer generate significant (valueweighted) ff6-adjusted return at the 5% level." they confirm this finding by showing that the gkx (2020) and cpz (2020) machine learning signals perform substantially weaker in economic conditions that limit arbitrage (i.e. low market liquidity, high market volatility, high investor sentiment). curiously though, avramov et al. (2020: 5) find that the only linear model they analyse kelly et al.'s (2019) ipca -"stands out … as it is less sensitive to market episodes of high limits to arbitrage." their finding as well as the results of cpz (2020) imply that economic conditions have to be explicitly accounted for when analysing the abilities and performance of neural networks. furthermore, avramov et al. (2020) as well as gkx (2020) and cpz (2020) make anecdotal observations that machine learning methods appear to reduce drawdowns. 1 while their manuscripts focused on return predictability, we devote our work to risk predictability in the context of market wide economic conditions. the covid-19 crisis as well as the density of economic crisis in the previous three decades imply that catastrophic 'black swan' type risks occur more frequent than predicted by symmetric economic distributions. consequently, underestimating tail risks can have catastrophic consequences for investors. hence, the analysis of risks with the ambition to avoid underestimations deserves, in our view, equivalent attention to the analysis of returns with its ambition to identify investment opportunities resulting from mispricing. more specifically, since a symmetric approach such as the "mean-variance framework implicitly assumes normality of asset returns, it is likely to underestimate the tail risk for assets with negatively skewed payoffs" (agarwal & naik, 2004:85) . empirically, equity market indices usually exhibit, not only since covid-19, negative skewness in its return payoffs (albuquerque, 2012 , kozhan et al. 2013 . consequently, it is crucial for a post covid-19 world with its substantial tail risk exposures (e.g. second pandemic wave, climate change, cyber security) that investors provided with tools which avoid the underestimation of risks best possible. naturally, neural networks with their near unlimited flexibility in modelling non-linearities appear suitable candidates for such conservative tail risk modelling that focuses on avoiding giglio & xiu (2019) , and kozak, nagel & santosh (2020) as also noteworthy, as are efforts by fallahgouly and franstiantoz (2020) and horel and giesecke (2019) to develop significant tests for neural networks. our paper investigates is basic and/or more advanced neural networks have the capability of underestimating tail risk less often at common statistical significance levels. we operationalize tail risk as value at risk which is the most used tail risk measure in both commercial practice as well as academic literature (billio et al. 2012 , billio and pellizon, 2000 , jorion, 2005 , nieto & ruiz, 2015 . specifically, we estimate var thresholds using classic methods (i.e. mean/variance, hidden markov model) 1 as well as machine learning methods (i.e. feed forward, convolutional, recurrent), which we advance via initialization of input parameter and regularization of incentive function. recognizing the importance of economic conditions (avramov et al. 2020 , chen et al. 2020 , we embed our analysis in a regime-based asset allocation setting. specifically, we perform monte-carlo simulations of asset returns for value at risk threshold estimation in a generative regime switching framework. using equity markets and long term bonds as test assets in the global, us, euro area and uk setting over an up to 1,250 weeks sample horizon ending in august 2018, we investigate neural networks along three design steps relating (i) to the initialization of the neural network's input parameter, (ii) its incentive function according to which it has been trained and which can lead to extreme outputs if it is not regularized as well as (iii) the amount of data we feed. first, we compare neural networks with random seeding with networks that are initialized via estimations from the best-established model (i.e. the hidden markov). we find latter to outperform in terms of the frequency of var breaches (i.e. the realized return falling short of the estimated var threshold). second, we balance the incentive structure of the loss function of our networks by adding a second objective to the training instructions so that the neural networks optimize for accuracy while also aiming to stay in empirically realistic regime distributions (i.e. bull vs. bear market frequencies). this design features leads to better regularization of the neural network, as it substantially reduces extreme outcomes than can result from a single incentive function. in particular this design feature enables the balanced incentive recurrent neural network (rnn) to outperform the single incentive rnn as well as any other neural network or established approach by statistically and economically significant levels. third, we half our training data set of 2,000 days. we find our networks when fed with substantially less data (i.e. 1,000 days) to perform significantly worse which highlights a crucial weakness of neural networks in their dependence on very large data sets. our contributions are fivefold. first, we extend the currently return focused literature of machine learning in finance (avramov et al. 2020 , chen et al. 2020 gu et al. 2020) to also focus on the estimation of risk thresholds. assessing the advancements that machine learning can bring to risk estimation potentially offers valuable innovation to asset owners such as pension funds and can better protect the retirement savings of their members. 2 second, we advance the design of our three types of neural networks by initializing their input parameter with the best established model. while initializations are a common research topic in core machine learnings fields such as image classification or machine translation (glorot & bengio, 2010 , we are not aware of any systematic application of initialized neural networks in the field of finance. hence, demonstrating the statistical superiority of an initialized neural network over itself non-initialized appears a relevant contribution to the community. third, while cpz (2020) regularize their neural networks via no arbitrage conditions, we regularize via balancing the incentive function of our neural networks on multiple objectives (i.e. estimation accuracy and empirically realistic regime distributions). this prevents any single objective from leading to extreme outputs and hence balances the computational power of the trained neural network in desirable directions. in fact, our results show that amendments to the incentive function maybe the strongest tool available to us in engineering neural networks. fourth, we also hope to make a marginal contribution to the literature on value at risk estimation. whereas our paper is focused on advancing machine learning techniques and is therefore following billio and pellizon (2000) anchored in a regime based asset allocation setting 1 to account for time varying economic states (cpz, 2020), we still believe that the nonlinearity and flexible form especially of recurrent neural networks maybe of interesting to the var (forecasting) literature (billio et al. 2012 , nieto & ruiz, 2015 , patton et al. 2019 . fifth, our final contribution lies in the documentation of weaknesses of neural networks as applied to finance. while avramov et al. (2020) subjects neural networks to real world economic constraints and finds these to substantially reduce their performance, we expose our neural networks to data scarcity and document just how much data these new approaches need to advance the estimation of risk thresholds. naturally, such long data history may not always be available in practice when estimating asset management var thresholds and therefore established methods and neural networks are likely to be used in parallel for the foreseeable future. in section two, we will describe our testing methodology including all five competing models (i.e. mean/variance, hidden markov model, feed forward neural network, convolutional neural network, recurrent neural network). section three describes data, model training, monte carlo simulations and baseline results. section four then advances our neural networks via initialization and balancing the incentive functions and discusses the results of both features. section five conducts robustness tests and sensitivity analyses before section six concludes. 1 we acknowledge that most recent statistical advances in value at risk estimation have concentrated on jointly modelling value at risk and expected shortfall and were therefore naturally less focused on time varying economic states (patton et al. 2019 , taylor 2019 , 2020 ). value at risk estimation with mean/variance approach when modelling financial time series related to investment decisions the asset return of portfolio (p) at time (t) as defined in equation (1) below is the focal point of interest instead of asset price , since investors earn on the difference between the price at which they sold. value-at-risk (var) metrics are an important tool in many areas of risk management. our particular focus on var measures as a means to perform risk budgeting in asset allocation. asset owners such as pension funds or insurances as well as asset managers often incorporate var measures into their investment processes (jorion, 2005) . value at risk is defined as in equation (2) as the lower bound of a portfolio's return, which the portfolio or asset is not expected to fall short off with a certain probability (a) within the next period of allocation (n). pr ( + < − ( )) = for example, an investment fund indicates that, based on the composition of its portfolio and on current market conditions, there is a 95% or 99% probability it will not lose more than a specified amount of assets over the next 5 trading days the var measurement can be interpreted as a threshold (billio and pellizon 2000) . if the actual portfolio or asset return falls below this threshold, we refer to this a var breach. the classic mean variance approach of measuring var values is based on the assumption that asset returns follow a (multivariate) normal distribution. var thresholds can then be measured by estimating the mean and covariance ( , σ) of the asset returns by calculating sample mean and sample covariance of the respective historical window. the 1% or 5% percentile of the resulting normal distribution will be an appropriate estimator of the 95% or 99% var threshold. we refer to this way of estimating var thresholds as being the "classical" approach and use it as baseline of our evaluation. this classic approach, however, does not sufficiently reflect the skewness of real world equity markets and the divergences of return distributions across different economics regimes. in other words, the classic approach does not take into account longer term market dynamics, which express themselves as phases of growth or of downside, also commonly known as bull market and bear markets. for this purpose, regime switching models have grown in popularity well before machine learning entered finance (billio and pellizon 2000) . in this study, we model financial markets inter alia using neural networks while accounting for shifts in economics regimes (avramov et al. 2020 , chen et al., 2020 . due to the generative nature of these networks, they are able to perform monte-carlo simulation of future returns, which could be beneficial for var estimation. in asset manager's risk budgeting it is advantageous to know about the current market phase (regime) and estimate the probability that the regime changes (schmeding et al., 2019) . the most common way of modelling market regimes is by distinguishing between bull markets and bear markets. unfortunately, market regimes are not directly observable, but are rather to be derived indirectly from market data. regime switching models based on hidden markov models are an established tool for regime based modelling. hidden markov models (hmm)which are based on markov chains -are models that allow for analysing and representing characteristics of time series such as negative skewness (ang and bekaert, 2002; timmerman, 2000) . we employ the hmm for the special case of two economic states called 'regimes' in the hmm context. specifically, we model asset returns y t ∈ n (we are looking at n ≥ 1 assets) at time t to follow an n-dimensional gaussian process with hidden states s ∈ {1, 2} as shown in equation (3): the returns are modelled to have state dependent expected returns μ ∈ as well as covariance σ ∈ . the dynamic of is following a homogenous markov chain with transition probability matrix with = ( = 1 | | −1 = 1 ) and = ( = 2 | | −1 = 2 ) . this definition describes if and how states are changing over time. it is also important to note the 'markov property' that the probability of being in any state at the next point in time only depends on the present state, not the sequence of states that preceded it. furthermore, the probability of being in a state at a certain point in time is given as π = ( = 1) and (1 − π ) = ( = 2). this is also called smoothed state probability. by estimating the smoothed probability πt of the last element of the historical window as the present regime probability, we can use the model to start from there and perform monte-carlo simulations of future asset returns for the next days. 1 this is outlined for the two-regimes case in figure 1 below. 2 figure 1 : algorithm for the hidden markov monte-carlo simulation (for two regimes) 1: estimate = ( 0 , , , σ) from history when graves [13] successfully made use of a long short-term memory (lstm) based recurrent neural network to generate realistic sequences of handwriting, he followed the idea of using a mixture density network (mdn) to parametrize a gaussian mixture predictive distribution (bishop, 1995) . compared to standard neural networks (multi-layer perceptron) as used by gkx (2020), this network does not only predict the conditional average of the target variable as point estimate (in gkx' case expected risk premia), but rather estimates the conditional distribution of the target variable. given the autoregressive nature of graves' approach, the output distributions are not assumed to be static over time, but dynamically conditioned on previous outputs, thus capturing the temporal context of the data. we consider both characteristics as being beneficial for modelling financial market returns, which experience a low signal to noise ratio as highlighted by gkx' results due to inherently high levels of intertemporal uncertainty. the core of the proposed neural network regime switching framework is a (swappable) neural network architecture, which takes as input the historical sequence of daily asset returns. at the output level, the framework computes regime probabilities and provides learnable gaussian mixture distribution parameters, which can be used to sample new asset returns for monte-carlo simulation. a multivariate gaussian mixture model (gmm) is a weighted sum of k different components, each following a distinct multivariate normal distribution as shown in equation (5): a gmm by its nature does not assume a single normal distribution, but naturally models a random variable as being the interleave of different (multivariate) normal distributions. in our model, we interpret k as the number of regimes and φi explains how much each regime contributes to the (current output). in other words, φi can be seen as the probability that we are in regime i. in this sense the gmm output provides a suitable level of interpretability for the use case of regime based modelling. with regard to the neural network regime switching model, we extend the notion of a gaussian mixture by conditioning φi via a yet undefined neural network f on the historic asset returns within a certain window of a certain size. we call this window receptive field and denote its size by r: this extension makes the gaussian mixture weights dependent on the (recent) history of the time varying asset returns. note that we only condition φ on the historical returns. the other parameters of the gaussian mixture ( , σ ), are modelled as unconditioned, yet optimizable parameters of the model. this basically means we assume the parameters of the gaussians to be constant over time (per regime). this is in contrast to the standard mdn, where ( , σ ) are also conditioned on and therefore can change over time. 1 keeping these remaining parameters unconditional is crucial to allow for a fair comparison between the neural networks and the hmm, which also exhibits time invariant parameters ( , σ ) in its regime shift probabilities. following graves (2013), we define the probability given by the network and the corresponding sequence as shown in equation (7) and (8), respectively: since financial markets operate in weekly cycles with many investors shying away from exposure to substantial leverage during the illiquid weekend period, we are not surprised to observe that model training is more stable when choosing the predictive distribution to not only be responsible for the next day, but for the next 5 days (hann and steuer, 1995) . we call this forward looking window the lookahead. this is also practically aligned with the overall investment process, in which we want to appropriately model the upcoming allocation period, which usually spans multiple days. it also fits with the intuition that regimes do not switch daily but have stability at least for a week. the extended sequence probability and sequence loss are denoted accordingly in equations (9) and (10): an important feature of the neural network regime model is how it simulates future returns. we follow graves (2013) approach and conduct sequential sampling from the network. when we want to simulate a path of returns for the next n business days, we do this according to the algorithm displayed in figure 2 . in accordance with gkx (2020) we first focus our analysis on traditional "feed-forward" neural networks before engaging in more sophisticated neural network architectures for time series analysis within the neural network regime model. the traditional model of neural networks, also called multi-layer perceptron, consists of an "input layer" which contains the raw input predictors and one or more "hidden layers" that combine input signals in a nonlinear way and an "output layer", which aggregates the output of the hidden layers into a final predictive signal. the nonlinearity of the hidden layers arises from the application of nonlinear "activation functions" on the combined signals. we visualise the traditional feed forward neural network and its input layers in figure 4 . we setup our network structure in alignment with gkx's (2020) best performance neural network 'nn3'. the setup of our network is thus given with 3 hidden layers with decreasing number of hidden units (32, 16, 8) . since we want to capture the temporal aspect of our time series data, we condition the network output on at least a receptive field of 10 days. even though the receptive field of the network is not very high in this case, the dense structure of the network results in a very high number of parameters (1698 in total, including the gmm parameters). in between layers, we make use of the activation function tanh. convolutional neural networks (cnns) can also be applied within the proposed neural network regime switching model. recently, cnns gained popularity for time series analysis, as for example van den oord et al. (2015) successfully applied convolutional neural networks on time series data for generating audio waveforms, the state-ofthe-art text-to-speech and music generation. their adaption of convolutional neural networkscalled wavenethas shown to be able to capture long ranging dependencies on sequences very well. in its essence, a wavenet consists of multiple layers of stacked convolutions along the time axis. crucial features of these convolutions are that they have to be causal and dilated. causal means that the output of a convolution only depends on past elements of the input sequence. dilated convolutions are ones that exhibit "holes" in their respective kernel, which effectively means that its filter size increases while being dilated with zeros in between. wavenet typically is constructed with increasing dilation factor (doubling in size) in each (hidden) layer. by doing so, the model is capable of capturing an exponentially growing number of elements from the input sequence depending on the number of hidden convolutional layers in the network. the number of captured sequence elements is called the receptive field of the network (and in this sense is equal to the receptive field defined for the neural network regime model). 1 the convolutional neural network (cnn), due to its structure of stacked dilated convolutions, has a much greater receptive field than the simple feed forward network and needs much less weights to be trained. we restricted the number of hidden layers to 3 to illustrate the idea. our network structure has 7 hidden layers. each hidden layer furthermore exhibits a number of channels, which are not visualized here. figure 5 illustrates the networks basic structure as a combination of stacked causal convolutions with a dilation factor of d = 2. the backing model presented in this investigation is inspired by wavenet, we restrict the model to the basic layout, using causal structure and increasing dilation between layers. the output layer comprises the regime predictive distributions by applying a softmax function to the hidden layers' outputs. our network consists of 6 hidden layers, each layer having 3 channels. the convolutions each have a kernel size of 3. in total, the network exhibits 242 weights (including gmm parameters), the receptive field has a size of 255 days. as graves (2013) was very successful in applying lstm for generating sequences, we also adapt this approach for the neural network regime switching model. originally introduced by hochreiter and schmidhuber (1997), a main characteristic of lstmswhich are a sub class of recurrent neural networks -is its purpose-built memory cells, which allows it to capture long range dependencies in the data. from a model perspective, lstms differ from other neural network architectures in that they are applied recurrently (see figure 6 ). the output from a previous sequence of the network function servesin combination with the next sequence element -as input for the next application of the network function. in this sense, the lstm can be interpreted as being similar to an hmm, in that there is a hidden state which conditions the output distribution. however, the lstm hidden state not only depends on its previous states, but it also captures long term sequence dependencies through its recurrent nature. maybe most notably, the receptive field size of an lstm is not bound architecture wise as in case of simple feed forward network and cnn. instead, the lstm's receptive field depends solely on the lstms ability to memorize the past input. in our architecture we have one lstm layer with a hidden state size of 5. in total, the model exhibits 236 parameters (including the gmm parameters). the potential of lstms was noted by cpz (2020: 6) who note that "lstms are designed to find patterns in time series data and … are among the most successful commercial ais". 3 assessment procedure we obtain daily price data for stock and bond indices globally for three major global markets (i.e. eu, uk, us) to study the presented regime based neural network approaches on a variety of stock markets and bond markets. for each stock market, we focus on one major stock index. for bond markets, we further distinguish between long term bond indices (7-10 years) and short term bond indices (1-3 years). the markets in scope are (1) the data dates back to at least january 1990 and ends with august 2018, which means covering almost 30 years of market development. hence, the data also accounts for crises like the dot-com bubble in the early 2000s as well as the financial crisis of 2008. this is especially important for testing the regime based approaches. the price indices are given as total return indices (i.e. dividends treated as being reinvested) to properly reflect market development. the data is taken from refinitiv's datastream. descriptive statistics are displayed in table 1 , whereby panel a displays a daily frequency and panel b a weekly frequency. mean returns for equities exceed the returns for bond whereby the longer bond return more than the shorter one. equities have naturally a much higher standard deviation and a far worse minimum return. in fact, equity returns in all four regions lose substantially more money than bond return even at the 25 th percentile, which highlights that the holy grail of asset allocation is the ability to predict equity market drawdowns. furthermore, equity markets tend to bequite negatively skewed as expected while short bonds experience a positive skewness, which reflects previous findings (albuquerque, 2012 , kozhan et al. 2013 ) and the inherent differential in the riskiness of both asset's payoffs. [insert table 1 about here] the back testing is done on a weekly basis via a moving window approach. at each point in time, the respective model is fitted by providing the last 2,000 days (which is roughly 8 years) as training data. we choose this long range window, because neural networks are known to need big datasets as inputs and it is reasonable to assume that over eight years include simultaneously times of (at least relative) crisis and times of market growth. covering both bull and bear markets in the training sample is crucial to allow the model to "learn" these types of regimes. 1 for all our models we set the number of regimes to k = 2. as we back test an allocation strategy with a weekly re-allocation, we set the lookahead for the neural network regime models to 5 days. we further configured the back testing dates to always align with the end of a business week (i.e. fridays). the classic approach does not need any configuration, model fitting is same as computing sample mean and sample covariance of the asset returns within the respective window. the hmm also does not need any more configuration, the baum-welch algorithm is guaranteed to converge the parameters into a local optimum with respect to the likelihood function (baum, 1970) . for the neural network regime models, additional data processing is required to learn network weights that lead to meaningful regime probabilities and distribution parameters. an important pre-processing step is input normalization, as it is considered good practice for neural network training (bishop, 1995) . for this purpose, we normalize the input data by ' = ( − ( )) / ( ) . in other words, we demean the input data and scale them by their variance but without removing the interactions between the assets. we train the network by using the adamax optimizing algorithm (kingma & ba, 2014) and at the same time applying weight decay to reduce overfitting (krogh & hertz, 1992) . learning rate and number of epochs configured for training vary depending on the model. in general, estimating parameters of a neural network model is a non-convex optimization problem. thus, the optimization algorithm might become stuck in an infeasible local optimum. in order to mitigate this problem, it is common practice to repeat the training multiple times, starting off having different (usually randomly chosen) parameter initializations, and then averaging over the resulting models or picking the best in terms of loss. in this paper, we follow a best-out-of-5 approach, that means each training is done five times with varying initialization and the best one is selected for simulation. the initialization strategy, which we will show in chapter 4.1, further mitigates this problem by starting off from an economically reasonable parameter set. we observe that the in-sample regime probabilities learned by the neural network regime switching models as compared to those estimated by the hmm based regime switching model generally show comparable results in terms of distribution and temporal dynamics. when we set k = 2 and the model fits two regimes with nearly invariably one having a positive corresponding equity means and low volatility, and the other experiencing a low or negative equity mean and high volatility. these regimes can be interpreted as bull and bear market, respectively. the respective insample regime probabilities over time also show strong alignment with growth and drawdown phases. this holds true for the vast majority of seeds and hence indicates that the neural network regime model is a valid practical alternative for regime modelling when compared to a hidden markov model. after training the model for a specific point in time, we start a monte carlo simulation of asset returns for the next 5 days (one week -monday to friday). for the purpose of calculating statistically solid quantiles of the resulting distribution, we simulate 100,000 paths for each model. we do this for at least 1093 (emu), and at most 1250 (globally) points in time within the back-test history window. as soon as we have simulated all return paths, we calculate a total (weekly) return for each path. the generated weekly returns follow a non-trivial distribution, which arises from the respective model and its underlying temporal dynamics. based on the simulations we compute quantiles for value at risk estimations. for example, the 0.01 and 0.05 percentile of the resulting distribution represent the 99% and 95% -5 day -var metric, respectively. we evaluate the quality of our value at risk estimations by counting the number of breaches of the asset returns. in case, the actual return is below the estimated var threshold, we count this as a breach. assuming an average performing model, it is e.g. reasonable to expect 5% breaches for a 95% var measurement. we compared the breaches of all models with each other. we classify a model as being superior to another model, if the number of var breaches is less than those from the compared model. a value comparison comp = 1.0(= 0.0) indicates that the row model is superior (inferior) to the column model. we performed significance tests by applying paired t-tests. we further evaluated a dominance value which is defined as shown in equation (11): in our view the three most crucial design features of neural networks in finance, where the sheer number of hidden layers appears less helpful due to the low signal to noise ratio (gkx, 2020), are: amount of input data, initializing information and incentive function. big input data is important for neural networks, as they need to consume sufficient evidence also of rarer empirical features to ensure that their nonlinear abilities in fitting virtually any functional form are used in a relevant instead of an exotic manner. similarly, the initialization of input parameters should be as much as possible based on empirically established estimates to ensure that the gradient descent inside the neural network takes off from a suitable point of departure, thereby substantially reducing the risks that a neural network confuses itself into irrelevant local minima. on the output side, every neural network is trained according to an incentive (i.e. loss) function. it is this particular loss function which determines the direction of travel for the neural network, which has no other ambitions than to minimize its loss best possible. hence, if the loss function only represents one of several practically relevant parameters, the neural network may come to results with bizarre outcomes for those parameters not included in its incentive function. in our case, for instance, the baseline incentive is just estimation accuracy which could lead to forecasts dominated much more by a single regime than ever observed in practice. in other words, after a long bull market, the neural network could "conclude" that bear markets do not exist. metaphorically spoken, a unidimensional loss function in a neural network has little decency (marcus, 2018) . commencing with the initialization and the incentive functions, we will assess our three neural networks in the following vis a vis classic and hmm approach, where each of the three networks is once displayed with an advanced design feature and once with a naïve design feature. if no specific initialization strategy for neural networks is defined, it occurs entirely random, normal via a computer generated random number. where established econometric approaches use naïve priors (i.e. mean), neural networks originally relied on brute force computing power and a bit of luck. hence, it is unsurprising that initializations are a common research topic in core machine learnings fields such as image classification or machine translation (glorot & bengio, 2010 nowadays. however, we are not aware of any systematic application of initialized neural networks in the field of finance. hence, we compare naïve neural networks, which are not initialized with neural networks that have been initialized with the best available prior. in our case, the best available prior for , σ of the model is the equivalent hmm estimation based on the same window. 1 such initialization is feasible, since the structure of the neural network -due to its similarity with respect to , σis broadly comparable with the hmm. in other words, we make use of already trained parameters from hmm training as starting parameters for the neural network training. in this sense, initialized neural networks are not only flexible in their functional form, they are also adaptable to "learn" from the best established model in the field if suitably supervised by the human data scientists. metaphorically spoken, our neural networks can stand on the shoulders of the giant that hmm is for regime based estimations. table 2 presents the results by comparing breaches between the two classic approaches (mean/variance, hmm) and the non-initialized and hmm initialized neural networks across all four regions. panel a and b display the 1% var threshold for equities and long bonds, respectively, while panels c and d show the equivalent comparison for 5% var thresholds. 2 note that for model training we apply a best-out-of-5 strategy as described in section 3.2. that means we repeat the training five times, starting off with random parameter initializations each time. in case of the presented hmm initialized model, we apply the same strategy, with the exception that , σ of the model are initialized the same for each of the five iterations. all residual parameters are initialized randomly as fits best according to the neural network part of the model. xxx findings are observable: first, not a single var threshold estimation process in a single region and in either of the two asset classes was able uphold its promise in that an estimated 1% var threshold should be breached no more than 1% of the time. this is very disappointing and quite alarming for institutional investors such as pension funds and insurance since it implies that all approachesestablished and machine learning basedfail to sufficiently capture downside tail risks and hence underestimate 1% var thresholds. the vast majority of approaches estimate var thresholds that occur in more than 2% of the cases and the lstm fails entirely if not initialised. in fact, even the best method, the hmm for us equities, estimates var thresholds which are breached in 1.34% of the cases. second, when inspecting the ability of our eight methods to estimate 5% var thresholds, the result remains bad but is less catastrophic. the mean/variance approach, the hmm and the initialised lstm display cases where their var thresholds were breaches in less than the expected 5%. the mean/variance and hmm approach make their thresholds in 3 out of 8 cases and the initialised lstm in 1 out of 8. overall, this is still a disappointing performance, especially for the feed forward neural network and the cnn. 1 even though we initialize , σ from hmm parameters, we still have weights to be initialized arising from the temporal neural network part of the model. we do this on a per layer level by sampling uniformly as where i is the number of input units for this layer. 2 we focus our discussion of results on the equities and long bonds since these have more variation, lower skewness and hence risk. results for the short bonds are available upon request from the contact author. third, when comparing the initialised with the non-initialised neural networks, the performance is like day vs. night. the non-initialised neural networks perform always worse and the lstm performs entirely dismal without a suitable prior. when comparing across all eight approaches, the hmm appears most competitive which means that we either have to further advance the design of our neural networks or their marginal value add beyond classic econometric approaches appears inexistent. to advance the design of our neural networks further, we aim to balance its utility function to avoid extreme unrealistic results possible in the univariate case. [insert table 2 about here] whereas cpz (2020) regularize their neural networks via no arbitrage conditions, we regularize via balancing the incentive function of our neural networks on multiple objectives. specifically, we extend the loss function to not only focus on accuracy of point estimates but also give some weight to eventually achieving empirically realistic regime distributions (i.e. in our data sample across all four regions no regimes display more than 60% frequency on a weekly basis). this balanced extension of the loss function prevents the neural networks from arriving at bizarre outcomes such as the conclusion that bear markets (or even bull markets) barely exist. technically, such bizarre outcomes result from cases where the regime probabilities φi(t) tend to converge globally either into 0 or 1 for all t, which basically means the neural network only recognises one-regime. to balance the incentive function of the neural network and facilitate balancing between regime contributions, we introduced an additional regularization term reg into the loss function which penalizes unbalanced regime probabilities. the regularization term is displayed in equation (13) below. if bear and bull market have equivalent regime probabilities the term converges to 0.5, while it converges towards 1 the larger the imbalance between the two regimes. substituting equation (13) into our loss function of equation (10), leads to equation (14) below, which doubles the point estimation based standard loss function in case of total regime balance inaccuracy but adds only 50% of the original loss function in case of full balance. conditioning the extension of the loss function on its origin is important to avoid biases due to diverging scales. setting the additional incentive function to initially have half the marginal weight of the original function also seems appropriate for comparability. the outcome of balancing the incentive functions of our neural networks are displayed in table 3 , where panels a-d are distributed as previously in table 2 . the results are very encouraging, especially with regards to the lstm. the regularized lstm is in all 32 cases (i.e. 2 thresholds, 2 asset classes, 4 regions) better than the non-regularized lstm. for the 5% var thresholds, it reaches realized occurrences of less than 4% in half the cases. this implies that the regularized lstm can even be more cautious than required. the regularized lstm also sets a new record for the 1% [insert table 4 about here] to measure how much value the regularized lstm can add compared to alternative approaches, we compute the annual accumulated costs of breaches as well as the average cost per breach. they are displayed in table 5 for the 5% var threshold. the regularized lstm is for both numbers in any case better than the classic approaches (mean/variance ad hmm) and the difference is economically meaningful. for equities the regularized lstm results in annual accumulated costs of 97-130 basis points less than the classic mean/variance approach, which would be up to over one billion us$ avoid loss per annum for a > us$100 billion equity portfolios of pension fund such as calpers or pggm. compared to the hmm approach, the regularized lstm avoids annual accumulated costs of 44-88 basis points, which is still a substantial amount of money for the vast majority of asset owners. with respect to long bonds, where total returns are naturally lower, the regularized lstm's avoided annual costs against the mean/variance and the hmm approach range between 23-30 basis points, which is high for bond markets. [insert table 5 about here] these statistically and economically attractive results have been achieved, however, based on 2,000 days of training data. such "big" amounts of data may not always be available for newer investment strategies. hence, it is natural to ask if the performance of the regularized neural networks drop when fed with just half the data (i.e. 1,000 days). apart from reducing statistical power, a period of over 4 years also may comprise less information on downside tail risks. indeed, the results displayed in table 6 show that in all context of var thresholds and asset classes, the regularized networks trained on 2,000 days substantially outperform and usually dominate their equivalently designed neural networks with half the training data. hence, the attractive risk management features for hmm initialised, balanced incentive lstms are likely only available for established discretionary investment strategies where sufficient historical data is available or for entirely rules-based approaches whose history can be replicated ex-post with sufficient confidence. [insert table 6 about here] we further conduct an array of robustness tests and sensitivity analysis to challenge our results and the applicability of neural network based regime switching models. as first robustness test, we extend the regularization in a manner that the balancing incentive function of equation (13) has the same marginal weight than the original loss function instead of just half the marginal weight. the performance of both types of regularized lstms is essentially equivalent second, we study higher var thresholds such as 10% and find the results to be very comparable to the 5% var results. third, we estimate monthly instead of weekly var. accounting for the loss of statistical power in comparison tests due to the lower number of observations, the results are equivalent again. we conduct two sensitivity analysis. first, we set up our neural networks to be generalized by two balancing incentive functions but without hmm initialisation. the results show the regularization enhances performance compared to the naïve non-regularized and non-initialized models but that both design features are needed to achieve the full performance. in other words, initialization and regularization seem additive design features in terms of neural network performance. second, we run analytical approaches with k > 2 regimes. adding a third or even fourth regime when asset prices only know two directions leads to substantial instability in the neural networks and tends to depreciate the quality of results. inspired by gkx (2020)'s and cpz (2020) to outperform the single incentive rnn as well as any other neural network or established approach by statistically and economically significant levels. third, we half our training data set of 2,000 days. we find our networks when fed with substantially less data (i.e. 1,000 days) to perform significantly worse which highlights a crucial weakness of neural networks in their dependence on very large data sets. hence, we conclude that well designed neural networks, i.e. a recurrent lstm neural network initialized with best current evidence and balanced incentivescan potentially advance the protection offered to institutional investors by var thresholds through a reduction in threshold breaches. however, such advancements rely on the availability of a long data history, which may not always be available in practice when estimating asset management var thresholds. descriptive statistics of the daily returns of the main equity index (equity), the main sovereign bond with (short) 1-3 years maturity (sb1-3y) and the main sovereign bond (long) with 7-10 year maturity (sb7-10). descriptive statistics include sample length, the first three moments of the return distribution and 11 thresholds along the return distribution. risks and portfolio decisions involving hedge funds skewness in stock returns: reconciling the evidence on firm versus aggregate returns can machines learn capital structure dynamics? working paper international asset allocation with regime shifts machine learning, human experts, and the valuation of real assets machine learning versus economic restrictions: evidence from stock return predictability a maximization technique occurring in the statistical analysis of probabilistic functions of markov chains bond risk premia with machine learning value-at-risk: a multivariate switching regime approach econometric measures of connectedness and systemic risk in the finance and insurance sectors neural networks for pattern recognition deep learning in asset pricing subsampled factor models for asset pricing: the rise of vasa microstructure in the machine age towards explaining deep learning: significance tests for multi-layer perceptrons asset pricing with omitted factors how to deal with small data sets in machine learning: an analysis on the cat bond market understanding the difficulty of training deep feedforward neural networks generating sequences with recurrent neural networks autoencoder asset pricing models much ado about nothing? exchange rate forecasting: neural networks vs. linear models using monthly and weekly data j. long short-term memory towards explainable ai: significance tests for neural networks improving earnings predictions with machine learning. working paper jorion, p.. value at risk characteristics are covariances: a unified model of risk and return adam: a method for stochastic optimization shrinking the cross-section the skew risk premium in the equity index market a simple weight decay can improve generalization advances in financial machine learning deep learning: a critical appraisal frontiers in var forecasting and backtesting dynamic semiparametric models for expected shortfall (and value-at-risk) maschinelles lernen bei der entwicklung von wertsicherungsstrategien. zeitschrift für das gesamte kreditwesen deep learning for mortgage risk forecasting value at risk and expected shortfall using a semiparametric approach based on the asymmetric laplace distribution forecast combinations for value at risk and expected shortfall moments of markov switching models verstyuk, s. 2020. modeling multivariate time series in economics: from auto-regressions to recurrent neural networks. working paper fixup initialization: residual learning without normalization. interantional conference on learning representations (iclr) paper acknowledgments: we are grateful for comments from theodor cojoianu, james hodson, juho kanniainen, qian li, yanan, andrew vivian, xiaojun zeng and participants at 2019 financial data science association conference in san francisco the international conference on fintech and financial data science at university college dublin (ucd). the views expressed in this manuscript are not necessarily shared by sociovestix labs, the technical expert group of dg fisma or warburg invest ag. authors are listed in alphabetical order, whereby hoepner serves as the contact author (andreas.hoepner@ucd.ie). any remaining errors are our own. key: cord-198449-cru40qp4 authors: carballosa, alejandro; mussa-juane, mariamo; munuzuri, alberto p. title: incorporating social opinion in the evolution of an epidemic spread date: 2020-07-09 journal: nan doi: nan sha: doc_id: 198449 cord_uid: cru40qp4 attempts to control the epidemic spread of covid19 in the different countries often involve imposing restrictions to the mobility of citizens. recent examples demonstrate that the effectiveness of these policies strongly depends on the willingness of the population to adhere them. and this is a parameter that it is difficult to measure and control. we demonstrate in this manuscript a systematic way to check the mood of a society and a way to incorporate it into dynamical models of epidemic propagation. we exemplify the process considering the case of spain although the results and methodology can be directly extrapolated to other countries. both the amount of interactions that an infected individual carries out while being sick and the reachability that this individual has within its network of human mobility have a key role on the propagation of highly contagious diseases. if we picture the population of a given city as a giant network of daily interactions, we would surely find highly clustered regions of interconnected nodes representing families, coworkers and circles of friends, but also several nodes that interconnect these different clustered regions acting as bridges within the network, representing simple random encounters around the city or perhaps people working at customer-oriented jobs. it has been shown that the most effective way to control the virulent spread of a disease is to break down the connectivity of these networks of interactions, by means of imposing social distancing and isolation measures to the population [1] . for these policies to succeed however, it is needed that the majority of the population adheres willingly to them since frequently these contention measures are not mandatory and significant parts of the population exploit some of the policies gaps or even ignore them completely. in diseases with a high basic reproduction number, i.e., the expected number of new cases directly generated by one infected case, such is the case of covid19, these individuals represent an important risk to control the epidemic as they actually conform the main core of exposed individuals during quarantining policies. in case of getting infected, they can easily spread the disease to their nearest connections in their limited but ongoing everyday interactions, reducing the effectiveness of the social distancing constrains and helping on the propagation of the virus. measures of containment and estimating the degree of adhesion to these policies are especially important for diseases where there can be individuals that propagate the virus to a higher number of individuals than the average infected. these are the so-called super-spreaders [2, 3] and are present in sars-like diseases such as the covid19. recently, a class of super-spreaders was successfully incorporated in mathematical models [4] . regarding the usual epidemiological models based on compartments of populations, a viable option is to introduce a new compartment to account for confined population [5] . again, this approach would depend on the adherence of the population to the confinement policies, and taking into account the rogue individuals that bypass the confinement measures, it is important to accurately characterize the infection curves and the prediction of short-term new cases of the disease, since they can be responsible of a dramatic spread. here, we propose a method that quantitatively measures the state of the public opinion and the degree of adhesion to an external given policy. then, we incorporate it into a basic epidemic model to illustrate the effect of changes in the social network structure in the evolution of the epidemic. the process is as follows. we reconstruct a network describing the social situation of the spanish society at a given time based on data from social media. this network is like a radiography of the social interactions of the population considered. then, a simple opinion model is incorporated to such a network that allows us to extract a probability distribution of how likely the society is to follow new opinions (or political directions) introduced in the net. this probability distribution is later included in a simple epidemic model computed along with different complex mobility networks where the virus is allowed to spread. the framework of mobility networks allows the explicit simulation of entire populations down to the scale of single individuals, modelling the structure of human interactions, mobility and contact patterns. these features make them a promising tool to study an epidemic spread (see [6] for a review), especially if we are interested in controlling the disease by means of altering the interaction patterns of individuals. at this point, we must highlight the difference between the two networks considered: one is collected from real data from social media and it is used to feel the mood of the collective society, while the other is completely in-silico and proposed as a first approximation to the physical mobility of a population. the study case considered to exemplify our results considers the situation in spain. this country was hard-hit by the pandemic with a high death-toll and the government reacted imposing a severe control of the population mobility that it is still partially active. the policy worked and the epidemic is controlled, nevertheless it has been difficult to estimate the level of adherence to those policies and the repercussions in the sickness evolution curve. this effect can also be determinant during the present transition to the so-called 'new normal'. the manuscript is organized as follows. in section 2 we describe the construction of the social network from scratch using free data from twitter, the opinion model is also introduced here and described its coupling to the epidemiological model. section 3 contains the main findings and computations of the presented models, and section 4 a summary and a brief discussion of the results, with conclusions and future perspectives. in order to generate a social network, we use twitter. we downloaded several networks of connections (using the tool nodexl [7] ). introducing a word of interest, nodexl brings information of users that have twitted a message containing the typed word and the connections between them. the topics of the different searches are irrelevant. in fact, we tried to choose neutral topics with potentiality to engage many people independently of political commitment, age, or other distinctions. the importance of each subnet is that it reveals who is following who and allows us to build a more complete network of connections once all the subnets are put together. each one of the downloaded networks will have approximately 2000 nodes [8] . in this way, downloading as many of such subnets as possible gives us a more realistic map of the current situation of the spanish twitter network and, we believe, a realistic approximation to the social interactions nationwide. we intended to download diverse networks politically inoffensive. 'junction' accounts will be needed to make sure that all sub-networks overlap. junction accounts are these accounts that are part of several subnets and warrant the connection between them. if these junction accounts did not exist, isolated local small networks may appear. go to supplementary information to see the word-of-interest networks downloaded and overlapped. twitter, as a social network, changes in time [9] , [10] , [11] and it is strongly affected by the current socio-political situation, so important variations in its configuration are expected with time. specifically, when a major crisis, such as the current one, is ongoing. taking this into consideration, we analyze two social neworks corresponding to different moments in time. one represents the social situation in october 2019 (with = 17665 accounts) which describes a pre-epidemic social situation and another from april 2020 (with = 24337 accounts) which describes the mandatory-confinement period of time. the networks obtained are directed and the links mark which nodes are following which. so, a node with high connectivity means it is following the opinions of many other nodes. the two social networks obtained with this protocol are illustrated in figure 1 . a first observation of their topologies demonstrate that they fit a scale free network with a power law connectivity distribution and exponents = 1.39 for october'19 and = 1.77 for april'20 network [12] . the significantly different exponents demonstrate the different internal dynamics of both networks. we generate the graphs in (a) and (b) using the algorithm force atlas 2 from gephi [13] . force atlas 2 is a forced-directed algorithm that stimulates the physical system to get the network organized through the space relying on a balance of forces; nodes repulse each other as charged particles while links attract their nodes obeying a hooke's law. so that, nodes that are more distant exchange less information. we consider a simple opinion model based on the logistic equation [14] but that has proved to be of use in other contexts [15, 16] . it is a two variable dynamical model whose nonlinearities are given by: where and account for the two different opinions. as + remains constant, we can use the normalization equation + = 1, and, thus, the system reduces to a single equation: is a time rate that modifies the rhythm of evolution of the variable , is a coupling constant and controls the stationary value of . this system has two fixed points ( 0 = 0 and 0 = + ⁄ + being the latest stable and 0 = 0 unstable. we now consider that each node belongs to a network and the connections between nodes follow the distribution measured in the previous section. the dynamic equation becomes [17] , each of the nodes obey the internal dynamic given by ( ) while being coupled with the rest of the nodes with a strength / where is a diffusive constant and is the connectivity degree for node (number of nodes each node is interacting with, also named outdegree). note that this is a directed non-symmetrical network where means that node is following the tweets from nodes. is the laplacian matrix, the operator for the diffusion in the discrete space, = 1, … , . we can obtain the laplacian matrix from the connections established within the network as = − , being the adjacency matrix notice that the mathematical definition in some references of the laplacian matrix has the opposite sign. we use the above definition given by [17] in parallelism with fick's law and in order to keep a positive sign in our diffusive system. now, we proceed as follows. we consider that all the accounts (nodes in our network) are in their stable fixed point 0 = + + , from equation (6), with a 10% of random noise. then a subset of accounts is forced to acquire a different opinion, = 1 with a 10% of random noise, ∀ / = 1, . . and we let the system to evolve following the dynamical equations (3) . in this case, accounts are sorted by the number of followers that it is easily controllable. therefore, some of the nodes shift their values to values closer to 1 that, in the context of this simplified opinion model, means that those nodes shifted their opinion to values closer to those leading the shift in opinion. this process is repeated in order to gain statistical significance and, as a result, it provides the probability distribution of nodes eager to change the opinion and adhere to the new politics. our epidemiological model is based on the classic sir model [18] and considers three different states for the population: susceptible (s), infected (i) and recovered or removed individuals (r) with the transitions as sketched in figure 2 . here represents the probability of infection and the probability of recovering. we assume that recovered individuals gain immunity and therefore cannot be infected again. we consider an extended model to account for the epidemic propagation where each node interacts with others in order to spread the virus. in this context we consider that each node belongs to a complex network whose topology describes the physical interactions between individuals. the meaning of node here is a single person or a set of individuals acting as a close group (i.e. families). the idea is that the infected nodes can spread the disease with a chance to each of its connections with susceptible individuals, thus becomes a control parameter of how many individuals an infected one can propagate the disease to at each time step. then, each infected individual has a chance of being recovered from the disease. a first order approach to a human mobility network is the watts-strogatz model [19] , given its ability to produce a clustered graph where nearest nodes have higher probability of being interconnected while keeping some chances of interacting with distant nodes (as in an erdös-renyi random graph [20] ). according to this model, we generate a graph of nodes, where each node is initially connected to its nearest neighbors in a ring topology and the connections are then randomly rewired with distant nodes with a probability . the closer this probability is to 1 the more resembling the graph is to a fully random network while for = 0 it becomes a purely diffusive network. if we relate this ring-shaped network with a spatial distribution of individuals, when is small the occurrence of random interactions with individuals far from our circle of neighbors is highly severed, mimicking a situation with strict mobility restrictions where we are only allowed to interact with the individuals from our neighborhood. this feature makes the watts-strogatz model an even more suitable choice for the purposes of our study since it allows us to impose further mobility restrictions to our individuals in a simple way. on the other hand, the effects of clustering in small-world networks with epidemic models are important and have been already studied [21] [22] [23] [24] . the network is initialized setting an initial number of nodes as infected while the rest are in the susceptible state and, then, the simulations starts. at each time step, the chance that each infected individual spreads the disease to each of its susceptible connections is evaluated by means of a monte carlo method [25] . then, the chance of each infected individual being recovered is evaluated at the end of the time step in the same manner. this process is repeated until the pool of infected individuals has decreased to zero or a stopping criterion is achieved. the following step in our modelling is to include the opinion model results from the previous section in the epidemic spread model just described. first, from the outcome of the opinion model , we build a probability density ( ̅) where ̅ = 1 − represents the disagreement with the externally given opinion. these opinion values are assigned to each of the nodes in the watts-strogatz network following the distribution ( ̅). next, we introduce a modified parameter, which varies depending on the opinion value of each node. it can be understood in terms of a weighted network modulated by the opinions, it is more likely that an infection occurs between two rogue individuals (higher value of ̅) rather than between two individuals who agree with the government confinement policies (̅ almost zero or very close to zero). we introduce, then, the weight ′ = ⋅ ̅ ⋅ ̅ , which accounts for the effective probability of infection between an infected node and a susceptible node . at each time step of the simulation, the infection chances are evaluated accordingly to the value ′ of the connection and the process is repeated until the pool of infected individuals has decreased to zero or the stopping criterion is achieved. in figure 3 , we exemplify this process through a network diagram, where white, black and grey nodes represent susceptible, infected and recovered individuals respectively. black connections account for possible infections with chance ′ . to account for further complexity, this approach could be extrapolated to more complex epidemic models already presented in the literature [4, 6, 26] . nevertheless, for the sake of illustration, this model still preserves the main features of an epidemic spread without adding the additional complexity to account for real situations such as the covid19 case. following the previous protocol, we run the opinion model considering the two social networks analyzed. figure 4 shows the distribution of the final states of the variable for the october'19 network (orange) and the april'20 network (green) when the new opinion is introduced in a 30% of the total population (r=30%). different percentages of the initial population r were considered but the results are equivalent (see figure s1 in the supplementary information). figure 4 clearly shows that the population on april'20 is more eager to follow the new opinion (political guidelines) comparing with the situation in october'19. in the pandemic scenario (network of april '20) it is noticeable that larger values of the opinion variable, , are achieved corresponding with the period of the quarantine. preferential states are also observed around = 0, = 0.5 and = 1. note that the network of april'20 allows to change opinions more easily than in the case of october'19. during the sanitary crisis in spain, the government imposed heavy restrictions on the mobility of the population. to better account for this situation, we rescaled the probability density of disagreement opinions ( ̅) to values between 0 and 0.3, leading to the probability densities of figure 5 . from here on, we shall refer to this maximum value of the rescaled probability density as the cutoff imposed to the probability density. note that this probability distribution is directly included into de mobility model as a probability to interact with other individuals, thus, this cutoff means that the government policy is enforced reducing up to a 70% of the interactions and the reminder 30% is controlled by the population decision to adhere to the official opinion. in figure 6 we summarized the main results obtained from the incorporation of the opinion model into the epidemiological one. we established four different scenarios: for the first one we considered a theoretical situation where we imposed that around the 70% of the population will adopt social distancing measures, but leave the other 30% in a situation where they either have an opinion against the policies or they have to move around interacting with the rest of the network for any reason (this means, ̅ = 0.3 for all the nodes). in contrast to this situation we introduce the opinion distribution of the social networks of april'20 and october'19. finally, we consider another theoretical population where at least 90% of the population will adopt social distancing measures (note that in a real situation, around 10% of the population occupies essential jobs and, thus, are still exposed to the virus). however, for the latter the outbreak of the epidemic does not occur so there is no peak of infection. note that the first and the last ones are completely in-silico scenarios introduced for the sake of comparison. figure 6a shows the temporal evolution of the infected population in the first three of the above scenarios. the line in blue shows the results without including an opinion model and considering that a 70% of the population blindly follows the government mobility restrictions while the reminding 30% continue interacting as usual. orange line shows the evolution including the opinion model with the probability distribution derived as in october'19. the green line is the evolution of the infected population considering the opinion model derived from the situation in april'20. note that the opinion model stated that the population in april'20 was more eager to follow changes in the opinion than in october'19, and this is directly reflected in the curves in figure 6a . also note that as the population becomes more conscious and decides to adhere to the restriction-of-mobility policies, the maximum of the infection curve differs in time and its intensity is diminished. this figure clearly shows that the state of the opinion inferred from the social network analysis strongly influences the evolution of the epidemic. the results from the first theoretical case (blue curve) show clearly that the disease reaches practically all the rogue individuals (around the 30% of the total population that we set with the rescaling of the probability density), while the other two cases with real data show that further agreement with the given opinion results in flatter curves of infection spreading. we have analyzed both the total number of infected individuals on the peaks and its location in time of the simulation, but, since our aim is to highlight the incorporation of the opinion model we show in figures 6b and 6c the values of the maximum peak infection as well as the delay introduced in achieving this maximum scaled with the corresponding values of the first case (blue line). we see that the difference on the degree of adhesion of the social networks outcomes a further 12% reduction approx. on the number of infected individuals at the peak, and a further delay of around the 20% in the time at which this peak takes place. note that for the april'20 social network, a reduction of almost the 50% of individuals is obtained for the peak of infection, and a similar value is achieved for the time delay of the peak. this clearly reflects the fact that a higher degree of adhesion is important to flatten the infection curve. finally, in the latter theoretical scenario, where we impose a cutoff of ̅ = 0.1, the outbreak of the epidemic does not occur, and thus there is no peak of infection. this is represented in figure 6b and 6c as a dash-filled bar indicating the absence of the said peak. changing the condition on the cutoff imposed for the variable ̅ can be of interest to model milder or stronger confinement scenarios such as the different policies ruled in different countries. in figure 7 we show the infection peak statistics (maximum of the infection curve and time at maximum) for different values of the cutoffs and for both social opinion networks. in both cases, the values are scaled with those from the theoretical scenario with all individuals having their opinion at the cutoff value. both measurements (figures 7a and 7b) are inversely proportional to the value of the cutoff. this effect can be understood in terms of the obtained probability densities. for both networks (october'19 and april '20) we obtained that most of the nodes barely changed their opinion, and thus for increasing levels on the cutoff of ̅ these counts dominate on the infection processes so the difference between both networks is reduced. on the other hand, this highlights the importance of rogue individuals in situations with increasing levels of confinement policies since for highly contagious diseases each infected individual propagates the disease rapidly. each infected individual matter and the less connections he or she has the harder is for the virus to spread along the exposed individuals. note that for all the scenarios, the social network of april'20 represents the optimum situation in terms of infection peak reduction and its time delay. it is particularly interesting the case for the cutoff in ̅ = 0.2. all simulations run for this cutoff show an almost non-existent peak. this is represented on figure 7a with almost a reduction of the 100% of the infection peak (the maximum value found on the infection curve was small but not zero) and the value of the time delay (figure 7b) as discussed in the previous section, we are considering a watts-strogatz model for the mobility network. this type of network is characterized by a probability of rewiring (as introduced in the previous section) that stablishes the number of distant connections for each individual in the network. all previous results were obtained considering a probability of rewiring of 0.25. figure 8 shows the variation of the maximum for the infection curve and time for the maximum versus this parameter. the observed trend indicates that the higher the clustering (thus, the lower the probability of rewiring) the more difficult is for the disease to spread along the network. this result is supported by previous studies in the field, which show that clustering decreases the size of the epidemics and in cases of extremely high clustering, it can die out within the clusters of population [21, 24] . this can be understood in terms of the average shortest path of the network [12] , which is a measure of the network topology that tells the average minimum number of steps required to travel between any two nodes of the network. starting from the ring topology, where only the nearest neighbors are connected, the average shortest path between any two opposite nodes is dramatically reduced with the random rewirings. remember that these new links can be understood as short-cuts or long-distance connections within the network. since the infection process can only occur between active links between the nodes, it makes sense that the propagation is limited if less of these long-distance connections exist in the network. the average shortest path length decays extremely fast with increasing values of the random rewiring, and thus we see that the peak statistics are barely affected for random rewirings larger than the 25%. if one is interested on further control of the disease, the connections with distant parts of the network must be minimized to values smaller than this fraction. regarding the performance of both opinion biased epidemic cases, we found again a clear difference between the two of them. in the april'19 case, the outcome of the model present always a more favorable situation to control the expansion of the epidemic, stating the importance of the personal adherence to isolation policies in controlling the evolution of the epidemic. we have parametrized the social situation of the spanish society at two different times with the data collected from a social media based on microblogging (twitter.com). the topology of these networks combined with a simple opinion model provides us with an estimate of how likely this society is to follow new opinions and change their behavioral habits. the first analysis presented here shows that the social situation in october 2019 differs significantly from that of april 2020. in fact, we have found that the latter is more likely to accept opinions or directions and, thus, follow government policies such as social distancing or confining. the output of these opinion models was used to tune the mobility in an epidemic model aiming to highlight the effect that the social 'mood' has on the pandemic evolution. the histogram of opinions was directly translated into a probability density of people choosing to follow or not the directions, modifying their exposedness to being infected by the virus. although we exemplify the results with an over-simplified epidemic model (sir), the same protocol can be implemented in more complicated epidemic models. we show that the partial consensus of the social network, although non perfect, induces a significant impact on the infection curve, and that this impact is quantitatively stronger in the network of april 2020. our results are susceptible to be included in more sophisticated models used to study the evolution of the covid19. all epidemic models lack to include the accurate effect of the society and their opinions in the propagation of epidemics. we propose here a way to monitor, almost in real time, the mood of the society and, therefore, include it in a dynamic epidemic model that is biased by the population eagerness to follow the government policies. further analysis of the topology of the social network may also provide insights of how likely the network can be influenced and identify the critical nodes responsible for the collective behavior of the network. in order to check the statistical accuracy and relevance of our networks, we considered different scenarios with more or less subnets (each subnet corresponding with a single hashtag) and estimate the exponent of the scale-free-network fit. this result is illustrated in figure s1a for the october'19 case and in figure s1b for the april'20 case. note that as the number of subnets (hashtags) is increased, the exponent converges. for 1 subnet all the exponents were calculated and for n subnets just one combination is possible so that non deviation is shown. distribution of the final states of the variable for the october'19 network (orange) and the april'20 network (green) when the new opinion is introduced by three different percentages of the total population (r parameter) is shown in figure s2 . note that in all cases the results are qualitatively equivalent and, once included in the opinion model, the results are similar. figure s2 . distribution of the concentrations for the twitter network from october 2019 (orange) and april 2020 (green) for r=20% (a), r=30% (b) and r=40% (c) of the initial accounts in the state 1 with a 10% of noise ( =0.0001, =0.01, =0.0001, 0=0.01, =20000). figure s3 shows the evolution of the number of infected individuals with time for the epidemic model biased with the opinion model of april 2020. results for different values of the ̅ cutoff are shown. note how for ̅ = 0.2 the peak of infection vanishes, and the epidemic dies out due to its lack of ability to spread among the nodes. on the other hand, figure s4 shows for different values of the cutoff on ̅, the comparison between the three cases presented in the main text (see figure 6 ): the theoretical scenario where the opinion is fixed on the cutoff value for all the nodes, and the epidemic model biased with the opinions of october '19 and april '20 scenarios. see how the difference between the theoretical scenario and the opinion biased models diminishes with growing values of the cutoff value on ̅ finally, figure s5 shows the effect that higher values of the rewiring probability of the watt-strogatz model has in the time evolution of the infected individuals. as shown in the main text, lower values of the rewiring probability has an important impact on the peak of infection, while values above = 0.3 barely change the statistics on the said peak, or fall within the error of the measurements. sectoral effects of social distancing one world, one health: the novel coronavirus covid-19 epidemic the role of superspreaders in infectious disease mathematical modeling of covid-19 transmission dynamics with a case study of wuhan predictability: can the turning point and end of an expanding epidemic be precisely forecast? epidemic processes on complex networks nodexl: a free and open network overview, discovery and exploration addin for excel evolving centralities in temporal graphs: a twitter network analysis analyzing temporal dynamics in twitter profiles for personalized recommendations in the social web emerging topic detection on twitter based on temporal and social terms evaluation gephi: an open source software for exploring and manipulating networks resherches mathematiques sur la loi d'accroissement de la population the coupled logistic map: a simple model for the effects of spatial heterogeneity on population dynamics logistic map with memory from economic model turing patterns in network-organized activatorinhibitor systems mathematical epidemiology of infectious diseases: model building, analysis and interpretation collective dynamics of small-world networks on random graphs. publicationes mathematicae epidemics and percolation in small-world networks the effects of local spatial structure on epidemiological invasions properties of highly clustered networks critical behavior of propagation on small-world networks metropolis, monte carlo and the maniac infectious diseases in humans this research is supported by the spanish ministerio de economía y competitividad and european regional development fund, research grant no. cov20/00617 and rti2018-097063-b-i00 aei/feder, ue; by xunta de galicia, research grant no. 2018-pg082, and the cretus strategic partnership, agrup2015/02, supported by xunta de galicia. all these programs are co-funded by feder (ue). we also acknowledge support from the portuguese foundation for science and technology (fct) within the project n. 147. 2. opinion distributions depending on the initial number of nodes with different opinion. the list of hashtags used to construct both networks is in table 1 for the october'19 case (column on the left) and for the april'20 scenario (right column). all hashtags used were neutral in the sense of political bias or age meaning. april '20 #eleccionesgenerales28a #cuidaaquientecuida #eldebatedecisivolasexta #estevirusloparamosunidos #pactosarv #quedateconesp #rolandgarros #semanaencasayoigo #niunamenos #quedateencasa #selectividad2019 #superviviente2020 #anuncioeleccions28abril #autonomosabandonados #blindarelplaneta #renta2019 #diamundialdelabicicleta' #encasaconsalvame #emergenciaclimatica27s' #diamundialdelasalud #cuarentenaextendida #asinonuvigo #ahoratocalucharjuntos #house_party #encasaconsalvame apoyare_a_sanchez pleno_del_congreso viernes_de_dolores key: cord-164703-lwwd8q3c authors: noury, zahra; rezaei, mahdi title: deep-captcha: a deep learning based captcha solver for vulnerability assessment date: 2020-06-15 journal: nan doi: nan sha: doc_id: 164703 cord_uid: lwwd8q3c captcha is a human-centred test to distinguish a human operator from bots, attacking programs, or other computerised agents that tries to imitate human intelligence. in this research, we investigate a way to crack visual captcha tests by an automated deep learning based solution. the goal of this research is to investigate the weaknesses and vulnerabilities of the captcha generator systems; hence, developing more robust captchas, without taking the risks of manual try and fail efforts. we develop a convolutional neural network called deep-captcha to achieve this goal. the proposed platform is able to investigate both numerical and alphanumerical captchas. to train and develop an efficient model, we have generated a dataset of 500,000 captchas to train our model. in this paper, we present our customised deep neural network model, we review the research gaps, the existing challenges, and the solutions to cope with the issues. our network's cracking accuracy leads to a high rate of 98.94% and 98.31% for the numerical and the alpha-numerical test datasets, respectively. that means more works is required to develop robust captchas, to be non-crackable against automated artificial agents. as the outcome of this research, we identify some efficient techniques to improve the security of the captchas, based on the performance analysis conducted on the deep-captcha model. captcha, abbreviated for completely automated public turing test to tell computers and humans apart is a computer test for distinguishing between humans and robots. as a result, captcha could be used to prevent different types of cyber security treats, attacks, and penetrations towards the anonymity of web services, websites, login credentials, or even in semiautonomous vehicles [13] and driver assistance systems [27] when a real human needs to take over the control of a machine/system. in particular, these attacks often lead to situations when computer programs substitute humans, and it tries to automate services to send a considerable amount of unwanted emails, access databases, or influence the online pools or surveys [4] . one of the most common forms of cyber-attacks is the ddos [8] attack in which the target service is overloaded with unexpected traffic either to find the target credentials or to paralyse the system, temporarily. one of the classic yet very successful solutions is utilising a captcha system in the evolution of the cybersecurity systems. thus, the attacking machines can be distinguished, and the unusual traffics can be banned or ignored to prevent the damage. in general, the intuition behind the captcha is a task that can distinguish humans and machines by offering them problems that humans can quickly answer, but the machines may find them difficult, both due to computation resource requirements and the algorithm complexity [5] . captchas can be in form of numerical or alpha-numerical strings, voice, or image sets. figure 1 shows a few samples of the common alpha-numerical captchas and their types. one of the commonly used practices is using text-based captchas. an example of these types of questions can be seen in figure 2 , in which a sequence of random alphanumeric characters or digits or combinations of them are distorted and drawn in a noisy image. there are many techniques and fine-details to add efficient noise and distortions to the captchas to make them more complex. for instance [4] and [9] recommends several techniques to add various type of noise to improve the security of captchas schemes such as adding crossing lines over the letters in order to imply an anti-segmentation schema. although these lines should not be longer than the size of a letter; otherwise, they can be easily detected using a line detection algorithm. another example would be using different font types, size, and rotation at the character level. one of the recent methods in this regard can be found in [28] which is called visual cryptography. on the other hand, there are a few critical points to avoid while creating captchas. for example, overestimating the random noises; as nowadays days the computer vision-based algorithms are more accurate and cleverer in avoiding noise in contrast to humans. besides, it is better to avoid very similar characters such as the number '0' and the letter 'o', letter 'l' and 'i' which cannot be easily differentiated, both by the computer and a human. besides the text-based captchas, other types of captchas are getting popular recently. one example would be image-based captchas that include sample images of random objects such as street signs, vehicles, statues, or landscapes and asks the user to identify a particular object among the given images [22] . these types of captchas are especially tricky due to the context-dependent spirit. figure 3 shows a sample of this type of captchas. however, in this paper, we will focus on text-based captchas as they are more common in high traffic and dense networks and websites due to their lower computational cost. before going to the next section, we would like to mention another application of the captcha systems that need to be discussed, which is its application in ocr (optical character recognition) systems. although current ocr algorithms are very robust, they still have some weaknesses in recognising different hand-written scripts or corrupted texts, limiting the usage of these algorithms. utilising captchas proposes an excellent enhancement to tackle such problems, as well. since the researchers try to algorithmically solve captcha challenges this also helps to improve ocr algorithms [7] . besides, some other researchers, such as ahn et al. [6] , suggest a systematic way to employ this method. the proposed solution is called recaptcha, and it merely offers a webbased captcha system that uses the inserted text to finetune its ocr algorithms. the system consists of two parts: first, the preparation stage which utilises two ocr algorithms to transcribe the document independently. then the outputs are compared, and then the matched parts are marked as correctly solved; and finally, the users choose the mismatched words to create a captcha challenge dataset [14] . this research tries to solve the captcha recognition problem, to detect its common weaknesses and vulnerabilities, and to improve the technology of generating captchas, to ensure it will not lag behind the ever-increasing intelligence of bots and scams. the rest of the paper is organised as follows: in section 2., we review on the literature by discussing the latest related works in the field. then we introduce the details of the proposed method in section 3.. the experimental results will be provided in section 4., followed by the concluding remarks in section 5.. in this this section, we briefly explore some of the most important and the latest works done in this field. geetika garg and chris pollett [1] performed a trained python-based deep neural network to crack fix-lengthed captchas. the network consists of two convolutional maxpool layers, followed by a dense layer and a softmax output layer. the model is trained using sgd with nesterov momentum. also, they have tested their model using recurrent layers instead of simple dense layers. however, they proved that using dense layers has more accuracy on this problem. in another work done by sivakorn et al. [2] , they have created a web-browser-based system to solve image captchas. their system uses the google reverse image search (gris) and other open-source tools to annotate the images and then try to classify the annotation and find similar images, leading to an 83% success rate on similar image captchas. stark et al. [3] have also used a convolutional neural network to overcome this problem. however, they have used three convolutional layers followed by two dense layers and then the classifiers to solve six-digit captchas. besides, they have used a technique to reduce the size of the required training dataset. in researches done in [4] and [9] the authors suggest addition of different types of noise including crossing line noise or point-based scattered noise to improve the complexity and security of the captchas patterns. furthermore, in [11] , [12] , [18] , and [31] , also cnn based methods have been proposed to crack captcha images. [24] has used cnn via the style transfer method to achieve a better result. [29] has also used cnn with a small modification, in comparison with the densenet [32] structure instead of common cnns. also, [33] and [21] have researched chinese captchas and employed a cnn model to crack them. on the other hand, there are other approaches which do not use convolutional neural networks, such as [15] . they use classical image processing methods to solve captchas. as another example, [17] uses a sliding window approach to segment the characters and recognise them one by one. another fascinating related research field would be the adversarial captcha generation algorithm. osadchy et al. [16] add an adversarial noise to an original image to make the basic image classifiers misclassifying them, while the image still looks the same for humans. [25] also uses the same approach to create enhanced text-based images. similarly, [26] and [10] , use the generative models and generative adversarial networks from different point of views to train a better and more efficient models on the data. deep learning based methodologies are widely used in almost all aspects of our life, from surveillance systems to autonomous vehicles [23] , robotics, and even in the recent global challenge of the covid-19 pandemic [35] . to solve the captcha problem, we develop a deep neural network architecture named deep-captcha using customised convolutional layers to fit our requirements. below, we describe the detailed procedure of processing, recognition, and cracking the alphanumerical captcha images. the process includes input data pre-processing, encoding of the output, and the network structure itself. applying some pre-processing operations such as image size reduction, colour space conversion, and noise reduction filtering can have a tremendous overall increase on the network performance. the original size of the image data used in this research is 135 × 50 pixel which is too broad as there exist many blank areas in the captcha image as well as many codependant neighbouring pixels. our study shows by reducing the image size down to 67 × 25 pixel, we can achieve almost the same results without any noticeable decrease in the systems performance. this size reduction can help the training process to become faster since it reduces the data without having much reduction in the data entropy. colour space to gray-space conversion is another preprocessing method that we used to reduce the size of the data while maintaining the same level of detection accuracy. in this way, we could further reduce the amount of redundant data and ease the training and prediction process. converting from a three-channel rgb image to a grey-scale image does not affect the results, as the colour is not crucial on the textbased captcha systems. the last preprocessing technique that we consider is the application of a noise reduction algorithm. after a careful experimental analysis on the appropriate filtering approaches, we decided to implement the conventional median-filter to remove the noise of the input image. the algorithm eliminates the noise of the image by using the median value of the surrounding pixels values instead of the pixel itself. the algorithm is described in algorithm 1 in which we generate the resultimage from the input 'image' using a predefined window size. unlike the classification problems where we have a specific number of classes in the captcha recognition problems, the number of classes depends on the number of digits and the length of the character set in the designed captcha. this leads to exponential growth depending on the number of classes to be detected. hence, for a captcha problem with five numerical digits, we have around 100,000 different combinations. as a result, we are required to encode the output data to fit into a single neural network. the initial encoding we used in this research was to employ nb input = d × l neurons, where d is the length of the alphabet set, and l is the character set length of the captcha. the layer utilises the sigmoid activation function: where x is the input value and s(x) is the output of the sigmoid function. by increasing the x, the s(x) conversing to 1 and by reducing it the s(x) is getting close to −1. applying sigmoid function adds a non-linearity feature to neurons which improves the learning potential and also the complexity of those neurons in dealing with non-linear inputs. these sets of neurons can be arranged in a way so that the first set of d neurons represent the first letter of the captcha; the second set of d neurons represent the second letter of the captcha, and so on. in other words, assuming d = 10, the 15 th neuron tells whether the fifth letter from the second character matches with the predicted alphabet or not. a visual representation can be seen in figure 4 .a, where the method encompasses three numerical serial digits that represent 621 as the output. however, this approach seemed not to be worthy due to its incapability of normalising the numerical values and also the impossibility of using the softmax function as the output layer of the intended neural network. therefore, we employed l parallel softmax layers, instead: where i is the corresponding class for which the softmax is been calculated, z i is the input value of that class, and k is the maximum number of classes. each softmax layer individually represents d neurons as figure 4 .b and these d neurons in return represent the alphabet that is used to create the captchas (for example 0 to 9, or a to z). l unit is represents the location of the digit in the captcha pattern (for example, locations 1 to 3). using this technique allows us to normalise each softmax unit individually over its neurons, d. in other words, each unit can normalise its weight over the different alphabets; hence it performs better, in overall. although the recurrent neural networks (rnns) can be one of the options to predict captcha characters, in this research we have focused on sequential models as they perform faster than rnns, yet can achieve very accurate results if the model is well designed. the structure of our proposed network is depicted in figure 5 . the network starts with a convolutional layer with 32 input neurons, the relu activation function, and 5 × 5 kernels. a 2×2 max-pooling layer follows this layer. then, we have two sets of these convolutional-maxpooling pairs with the same parameters except for the number of neurons, which are set to 48 and 64, respectively. we have to note that all of the convolutional layers have the "same" padding parameter. after the convolutional layers, there is a 512 dense layer with the relu activation function and a 30% drop-out rate. finally, we have l separate softmax layers, where l is the number of expected characters in the captcha image. the loss function of the proposed network is the binarycross entropy as we need to compare these binary matrices all together: were n is the number of samples and p is the predictor model. the x i and y i represent the input data and the label of the i th sample, respectively. since the label could be either zero or one, therefore only one part of this equation would be active for each sample. we also employed adam optimiser, which is briefly described in equations 4 to 8 where m t and v t representing an exponentially decaying average of the past gradients and past squared gradients, respectively. β 1 and β 2 are configurable constants. g t is the gradient of the optimising function and t is the learning iteration. in equations 6 and 7, momentary values for m and v are calculated as follows: finally, using equation 8 and by updating θ t in each iteration, the optimum value of the function could be attained. m t andv t are calculated via equations 6 and 7 and η, the step size (also known as learning rate) is set to 0.0001 in our approach. the intuition behind using adam optimiser is its capability in training the network in a reasonable time. this can be easily inferred from figure 6a in which the adam optimiser achieves the same results in comparison with stochastic gradient descent (sgd), but with a much faster convergence. after several experiments, we trained the network for 50 epochs with a batch size of 128 for each. as can be inferred from figure 6a , even after 30 epochs the network tends to an acceptable convergence. as a result, 50 epochs seem to be sufficient for the network to perform steadily. furthermore, figure 6e would also suggest the same inference based on the measured accuracy metrics. after developing the above-described model, we trained the network on 500,000 randomly generated captchas using python imagecaptcha library [38] . see figure 7 for some of the randomly generated numerical captchas with the fixed lengths of five-digits. to be balanced, the dataset consists of ten randomly generated images from each permutation of a five-digit text. we tested the proposed model on another set of half a million captcha images as our test dataset. as represented in table i , the network reached the overall performance and accuracy rate of 99.33% on the training set and 98.94% on the test dataset. we have to note that the provided accuracy metrics are calculated based on the number of correctly detected captchas as a whole (i.e. correct detection of all five individual digits in a given captcha); otherwise, the accuracy of individual digits are even higher, as per the table ii . we have also conducted a confusion matrix check to visualise the outcome of this research better. figure 8 shows how the network performs on each digit regardless of the position of that digit in the captcha string. as a result, the network seems to work extremely accurately on the digits, with less than 1% misclassification for each digit. by analysing the network performance and visually inspecting 100 misclassified samples we pointed out some important results as follows that can be taken into account to decrease the vulnerability of the captcha generators: while an average human could solve the majority of the misclassified captchas, the following weaknesses were identified in our model that caused failure by the deep-captcha solver: • in 85% of the misclassified samples, the gray-level intensity of the generated captchas were considerably lower than the average intensity of the gaussian distributed pepper noise in the captcha image. • in 54% of the cases, the digits 3, 8, or 9 were the cause of the misclassification. • in 81.8% of the cases, the misclassified digits were rotated for 10 • or more. • confusion between the digits 1 and 7 was also another cause of the failures, particularly in case of more than 20 • counter-clockwise rotation for the digit 7. consequently, in order to cope with the existing weakness and vulnerabilities of the captcha generators, we strongly suggest mandatory inclusion of one or some of the digits 3, fig. 7 : samples of the python numerical image-captcha library used to train the deep-captcha. 7, 8 and 9 (with/without counter-clockwise rotations) with a significantly higher rate of embedding in the generated captchas comparing to the other digits. this will make the captchas harder to distinguish for automated algorithms such as the deep-captcha, as they are more likely to be confused with other digits, while the human brain has no difficulties in identifying them. a similar investigation was conducted for the alphabetic part of the failed detections by the deep-captcha and the majority of the unsuccessful cases were tied to either too oriented characters or those with close contact to neighbouring characters. for instance, the letter "g" could be confused with "8" in certain angles, or a "w" could be misclassified as an "m" while contacting with an upright letter such as "t ". in general, the letters that can tie together with one/some of the letters: w, v, m, n can make a complex scenario for the deep-captcha. therefore we suggest more inclusion of these letters, as well as putting these letters in close proximity to others letter, may enhance the robustness of the captchas. our research also suggests brighter colour (i.e. lower grayscale intensity) alpha-numerical characters would also help to enhance the difficulty level of the captchas. in this section, we compare the performance of our proposed method with 10 other state-of-the-art techniques. the comparison results are illustrated in table iii followed by further discussions about specification of each method. as mentioned in earlier sections, our approach is based on convolutional neural network that has three pairs of convolutional-maxpool layers followed by a dense layer that is connected to a set of softmax layers. finally, the network is trained with adam optimiser. in this research we initially focused on optimising our network to solve numerical captchas; however, since many existing methods work on both numerical and alphanumerical captchas, we developed another network capable of solving both types. also, we trained the network on 700,000 alphanumerical captchas. for a better comparison and to have a more consistent approach, we only increased the number of neurons in each softmax units from 10 to 31 to cover all common latin characters and digits. the reason behind having 31 neurons is that we have used all latin alphabets and numbers except for i, l, 1, o, 0 due to their similarity to each other and existing difficulties for an average human to tell them apart. although we have used both upper and lower case of each letter to generate a captcha, we only designate a single neuron for each of these cases in order to simplicity. in order to compare our solution, first, we investigated the research done by wang et al. [29] which includes evaluations on the following approaches: densenet-121 and resnet-50 which are fine-tuned model of the original densenet and resnet networks to solve captchas as well as dfcr which is an optimised method based on the densenet network. the dfcr has claimed an accuracy of 99.96% which is the best accuracy benchmark among other methods. however, this model has only been trained on less than 10,000 samples and only on four-digit captcha images. although the quantitative comparison in table iii shows the [29] on top of our proposed method, the validity of the method can neither be verified on larger datasets, nor on complex alphanumerical captchas with more than half a million samples, as we conducted in our performance evaluations. the next comparing method is [36] which uses an svm based method and also implementation of the vgg-16 network to solve captcha problems. the critical point of this method is the usage of image preprocessing, image segmentation and one by one character recognition. these techniques have lead to 98.81% accuracy on four-digit alphanumerical captchas. the network has been trained on a dataset composed of around 10,000 images. similarly, tod-cnn [20] have utilised segmentation method to locate the characters in addition to using a cnn model which is trained on a 60,000 dataset. the method uses a tensorflow object detection (tod) technique to segment the image and characters. goodfellow et al. [14] have used distbelief implementation of cnns to recognise numbers more accurately. the dataset used in this research was the street view house numbers (svhn) which contains images taken from google street view. finally, the last discussed approach is [37] which compares vgg16, vgg cnn m 1024, and zf. although they have relatively low accuracy compared to other methods, they have employed r-cnn methods to recognise each character and locate its position at the same time. in conclusion, our methods seem to have relatively satisfactory results on both numerical and alphanumerical captchas. having a simple network architecture allows us to utilise this network for other purposes with more ease. besides, having an automated captcha generation technique allowed us to train our network with a better accuracy while maintaining the detection of more complex and more comprehensive captchas comparing to state-of-the-art. we designed, customised and tuned a cnn based deep neural network for numerical and alphanumerical based captcha detection to reveal the strengths and weaknesses of the common captcha generators. using a series of paralleled softmax layers played an important role in detection improvement. we achieved up to 98.94% accuracy in comparison to the previous 90.04% accuracy rate in the same network, only with sigmoid layer, as described in section 3.2. and table i . although the algorithm was very accurate in fairly random captchas, some particular scenarios made it extremely challenging for deep-captcha to crack them. we believe taking the addressed issues into account can help to create more reliable and robust captcha samples which makes it more complex and less likely to be cracked by bots or aibased cracking engines and algorithms. as a potential pathway for future works, we suggest solving the captchas with variable character length, not only limited to numerical characters but also applicable to combined challenging alpha-numerical characters as discussed in section 4.. we also recommend further research on the application of recurrent neural networks as well as the classical image processing methodologies [30] to extract and identify the captcha characters, individually. neural network captcha crackers i am robot:(deep) learning to break semantic image captchas captcha recognition with active deep learning recognition of captcha characters by supervised machine learning algorithms captcha: using hard ai problems for security recaptcha: human-based character recognition via web security measures designing a secure text-based captcha ddos attack evolution accurate, data-efficient, unconstrained text recognition with convolutional neural networks yet another text captcha solver: a generative adversarial network based approach breaking microsofts captcha breaking captchas with convolutional neural networks look at the driver, look at the road: no distraction! no accident! multi-digit number recognition from street view imagery using deep convolutional neural networks an optimized system to solve text-based captcha no bot expects the deepcaptcha! introducing immutable adversarial examples, with applications to captcha generation the end is nigh: generic solving of text-based captchas captcha breaking with deep learning a survey on breaking technique of text-based captcha a low-cost approach to crack python captchas using ai-based chosen-plaintext attack a security analysis of automated chinese turing tests im not a human: breaking the google recaptcha simultaneous analysis of driver behaviour and road condition for driver distraction detection captcha image generation using style transfer learning in deep neural network captcha image generation systems using generative adversarial networks a generative vision model that trains with high data efficiency and breaks text-based captchas toward next generation of driver assistance systems: a multimodal sensor-based platform applying visual cryptography to enhance text captchas captcha recognition based on deep convolutional neural network object detection, classification, and tracking captcha recognition with active deep learning densely connected convolutional networks an approach for chinese character captcha recognition using cnn image flip captcha zero-shot learning and its applications from autonomous vehicles to covid-19 diagnosis: a review research on optimization of captcha recognition algorithm based on svm captcha recognition based on faster r-cnn key: cord-269711-tw5armh8 authors: ma, junling; van den driessche, p.; willeboordse, frederick h. title: the importance of contact network topology for the success of vaccination strategies date: 2013-05-21 journal: journal of theoretical biology doi: 10.1016/j.jtbi.2013.01.006 sha: doc_id: 269711 cord_uid: tw5armh8 abstract the effects of a number of vaccination strategies on the spread of an sir type disease are numerically investigated for several common network topologies including random, scale-free, small world, and meta-random networks. these strategies, namely, prioritized, random, follow links and contact tracing, are compared across networks using extensive simulations with disease parameters relevant for viruses such as pandemic influenza h1n1/09. two scenarios for a network sir model are considered. first, a model with a given transmission rate is studied. second, a model with a given initial growth rate is considered, because the initial growth rate is commonly used to impute the transmission rate from incidence curves and to predict the course of an epidemic. since a vaccine may not be readily available for a new virus, the case of a delay in the start of vaccination is also considered in addition to the case of no delay. it is found that network topology can have a larger impact on the spread of the disease than the choice of vaccination strategy. simulations also show that the network structure has a large effect on both the course of an epidemic and the determination of the transmission rate from the initial growth rate. the effect of delay in the vaccination start time varies tremendously with network topology. results show that, without the knowledge of network topology, predictions on the peak and the final size of an epidemic cannot be made solely based on the initial exponential growth rate or transmission rate. this demonstrates the importance of understanding the topology of realistic contact networks when evaluating vaccination strategies. the importance of contact network topology for the success of vaccination strategies for many viral diseases, vaccination forms the cornerstone in managing their spread and the question naturally arises as to which vaccination strategy is, given practical constraints, the most effective in stopping the disease spread. for evaluating the effectiveness of a vaccination strategy, it is necessary to have as precise a model as possible for the disease dynamics. the widely studied key reference models for infectious disease epidemics are the homogeneous mixing models where any member of the population can infect or be infected by any other member of the population; see, for example, anderson and may (1991) and brauer (2008) . the advantage of a homogeneous mixing model is that it lends itself relatively well to analysis and therefore is a good starting point. due to the homogeneity assumption, these models predict that the fraction of the population that needs to be vaccinated to curtail an epidemic is equal to 1à1=r 0 , where r 0 is the basic reproduction number (the average number of secondary infections caused by a typical infectious individual in a fully susceptible population). however, the homogeneous mixing assumption poorly reflects the actual interactions within a population, since, for example, school children and office co-workers spend significant amounts of time in close proximity and therefore are much more likely to infect each other than an elderly person who mostly stays at home. consequently, efforts have been made to incorporate the network structure into models, where individuals are represented by nodes and contacts are presented by edges. in the context of the severe acute respiratory syndrome (sars), it was shown by meyers et al. (2005) that the incorporation of contact networks may yield different epidemic outcomes even for the same basic reproduction number r 0 . for pandemic influenza h1n1/09, pourbohloul et al. (2009) and davoudi et al. (2012) used network theory to obtain a real time estimate for r 0 . numerical simulations have shown that different networks can yield distinct disease spread patterns; see, for example, bansal et al. (2007) , miller et al. (2012) , and section 7.6 in keeling and rohani (2008) . to illustrate this difference for the networks and parameters we use, the effect of different networks on disease dynamics is shown in fig. 1 . descriptions of these networks are given in section 2 and appendix b. at the current stage, most theoretical network infectious disease models incorporate, from a real world perspective, idealized random network structures such as regular (all nodes have the same degree), erd + os-ré nyi or scale-free random networks where clustering and spatial structures are absent. for example, volz (2008) used a generating function formalism (an alternate derivation with a simpler system of equations was recently found by miller, 2011) , while we used the degree distribution in the effective degree model presented in lindquist et al. (2011) . in these models, the degree distribution is the key network characteristic for disease dynamics. from recent efforts (ma et al., 2013; volz et al., 2011; moreno et al., 2003; on incorporating degree correlation and clustering (such as households and offices) into epidemic models, it has been found that these may significantly affect the disease dynamics for networks with identical degree distributions. fig. 2 shows disease dynamics on networks with identical degree distribution and disease parameters, but with different network topologies. clearly, reliable predictions of the epidemic process that only use the degree distribution are not possible without knowledge of the network topology. such predictions need to be checked by considering other topological properties of the network. network models allow more precise modeling of control measures that depend on the contact structure of the population, such as priority based vaccination and contact tracing. for example, shaban et al. (2008) consider a random graph with a pre-specified degree distribution to investigate vaccination models using contact tracing. kiss et al. (2006) compared the efficacy of contact tracing on random and scale-free networks and found that for transmission rates greater than a certain threshold, the final epidemic size is smaller on a scale-free network than on a corresponding random network, while they considered the effects of degree correlations in kiss et al. (2008) . cohen et al. (2003) (see also madar et al., 2004) considered different vaccination strategies on scale-free networks and found that acquaintance immunization is remarkably effective. miller and hyman (2007) considered several vaccination strategies on a simulation of the population of portland oregon, usa, and found it to be most effective to vaccinate nodes with the most unvaccinated susceptible contacts, although they found that this strategy may not be practical because it requires considerable computational resources and information about the network. bansal et al. (2006) took a contact network using data from vancouver, bc, canada, considered two vaccination strategies, namely mortality-and morbidity-based, and investigated the detrimental effect of vaccination delays. and found that, on realistic contact networks, vaccination strategies based on detailed network topology information generally outperform random vaccination. however, in most cases, contact network topologies are not readily available. thus, how different network topologies affect various vaccination strategies remains of considerable interest. to address this question, we explore two scenarios to compare percentage reduction by vaccination on the final size of epidemics across various network topologies. first, various network topologies are considered with the disease parameters constant, assuming that these have been independently estimated. second, different network topologies are used to fit to the observed incidence curve (number of new infections in each day), so that their disease parameters are different yet they all line up to the same initial exponential growth phase of the epidemic. vaccines are likely lacking at the outbreak of an emerging infectious disease (as seen in the 2009 h1n1 pandemic, conway et al., 2011) , and thus can only be given after the disease is already widespread. we investigate numerically whether network topologies affect the effectiveness of vaccination strategies started with a delay after the disease is widespread; for example, a 40 day delay as in the second wave of the 2009 influenza pandemic in british columbia, canada (office of the provincial health officer, 2010). details of our numerical simulations are given in appendix a. this paper is structured as follows. in section 2, a brief overview of the networks and vaccination strategies (more details are provided in appendices b and c) is given. in section 3, we investigate the scenario where the transmission rate is fixed, while in section 4 we investigate the scenario where the growth rate of the incidence curve is fixed. to this end, we compute the incidence curves and reductions in final sizes (total number of infections during the course of the epidemic) due to vaccination. for the homogeneous mixing model, these scenarios are identical (ma and earn, 2006) , but as will be shown, when taking topology into account, they are completely different. we end with conclusions in section 5. . on all networks, the average degree is 5, the population size is 200,000, the transmission rate is 0.06, the recovery rate is 0.2, and the initial number of infectious individuals is set to 100. both graphs represent the same data but the left graph has a semi-log scale (highlighting the growth phase) while the right graph has a linear scale (highlighting the peak). (b)) on networks with identical disease parameters and degree distribution (as shown in (a)). the network topologies are the random, meta-random, and near neighbor networks. see appendix b for details of the constructions of these networks. detailed network topologies for human populations are far from known. however, this detailed knowledge may not be required when the main objective is to assert the impact that topology has on the spread of a disease and on the effects of vaccination. it may be sufficient to consider a number of representative network topologies that, at least to some extent, can be found in the actual population. here, we consider the four topologies listed in table 1 , which we now briefly describe. in the random network, nodes are connected with equal probability yielding a poisson degree distribution. in a scale-free network, small number of nodes have a very large number of links and large number of nodes have a small number of links such that the degree distribution follows a power law. small world (sw) networks are constructed by adding links between randomly chosen nodes on networks in which nodes are connected to the nearest neighbors. the last network considered is what we term a meta-random network where random networks of various sizes are connected with a small number of interlinks. all networks are undirected with no self loops or multiple links. the histograms of the networks are shown in table 2 , and the details of their construction are given in appendix b. the vaccination strategies considered are summarized in table 3 . in the random strategy, an eligible node is randomly chosen and vaccinated. in the prioritized strategy, nodes with the highest degrees are vaccinated first, while in the follow links strategy, inspired by notions from social networks, a randomly chosen susceptible node is vaccinated and then all its neighbors and then its neighbor's neighbors and so on. finally, in contact tracing, the neighbors of infectious nodes are vaccinated. for all the strategies, vaccination is voluntary and quantity limited. that is, only susceptibles who do not refuse vaccination are vaccinated and each day only a certain number of doses is available. in the case of (relatively) new viral diseases, the supply of vaccines will almost certainly be constrained, as was the case for the pandemic influenza h1n1/09 virus. also in the case of mass vaccinations, there will be resource limitations with regard to how many doses can be administered per day. the report (office of the provincial health officer, 2010) states that the vaccination program was prioritized and it took 3 weeks before the general population had access to vaccination. thus we assume that a vaccination program can be completed in 4-6 weeks or about 40 days, this means that for a population of 200,000, a maximum of 5000 doses a day can be used. for each strategy for each time unit, first a group of eligible nodes is identified and then up to the maximum number of doses is dispensed among the eligible nodes according to the strategy chosen. more details of the vaccination strategies and their motivations are given in appendix c. to study the effect of delayed availability of vaccines during an emerging infectious disease, we compare the effect of vaccination programs starting on the first day of the epidemic with those vaccination programs starting on different days. these range from 5 to 150 days after the start of the epidemic, with an emphasis on a 40 day delay that occurred in british columbia, canada, during the influenza h1n1/2009 pandemic. when a node is vaccinated, the vaccination is considered to be ineffective in 30% of the cases (bansal et al., 2006) . in such cases, the vaccine provides no immunity at all. for the 70% of the nodes for which the vaccine will be effective, a two week span to reach full immunity is assumed (clark et al., 2009) . during the two weeks, we assume that the immunity increases linearly starting with 0 at the time of vaccination reaching 100% after 14 days. the effect of vaccination strategies has been studied (see, for example, conway et al., 2011) using disease parameter values estimated in the literature. however, network topologies were not the focus of these studies. in section 3, the effect of vaccination strategies on various network topologies is compared with a fixed per link transmission rate. the per link transmission rate b is difficult to obtain directly and is usually derived as a secondary quantity. to determine b, we pick the basic reproduction number r 0 ¼ 1:5 and the recovery rate g ¼ 0:2, which are close to that of the influenza a h1n1/09 virus; see, for example, pourbohloul et al. (2009 ), tuite et al. (2010 . in the case of the homogeneous mixing sir model, the basic reproduction number is given by r 0 ¼ t=g, where t is the per-node transmission rate. our table 1 illustration of the different types of networks used in this paper. scale-free small world meta-random table 2 degree histograms of the networks in table 1 with 200,000 nodes. scale-free small world meta-random parameter values yield t ¼ 0:3. for networks, t ¼ b/ks. with the assumption that the average degree /ks ¼ 5, the above gives the per-link transmission rate b ¼ 0:06. the key parameters are summarized in table 4 . in section 3, we use this transmission rate to compare the incidence curves for the networks in table 1 with the vaccination strategies in table 3 . some of the most readily available data in an epidemic are the number of reported new cases per day. these cases generally display exponential growth in the initial phase of an epidemic and a suitable model therefore needs to match this initial growth pattern. the exponential growth rates are commonly used to estimate disease parameters (chowell et al., 2007; lipsitch et al., 2003) . in section 4, we consider the effects of various network topologies on the effectiveness of vaccination strategies for epidemics with a fixed exponential growth rate. the basic reproduction number r 0 ¼ 1:5 and the recovery rate g ¼ 0:2 yield an exponential growth rate l ¼ tàg ¼ 0:1 for the homogeneous mixing sir model. we tune the transmission rate for each network topology to give this initial growth rate. in this section, the effectiveness of vaccination strategies on various network topologies is investigated for a given set of parameters, which are identical for all the simulations. the values of the disease parameters are chosen based on what is known from influenza h1n1/09. qualitatively, these chosen parameters should provide substantial insight into the effects topology has on the spread of a disease. unless indicated otherwise the parameter values listed in table 4 are used. the effects of the vaccination strategies summarized in table 3 when applied without delay are shown in fig. 3 . for reference, fig. 1 shows the incidence curves with no vaccination. since the disease dies out in the small world network (see fig. 1 ), vaccination is not needed in this network for the parameter values taken. especially in the cases of the random and meta-random networks, the effects of vaccination are drastic while for the scale-free network they are still considerable. what is particularly notable is that when comparing the various outcomes, topology has as great if not a greater impact on the epidemic than the vaccination strategy. besides the incidence curves, the final sizes of epidemics and the effect vaccination has on these are also of great importance. table 5 shows the final sizes and the reductions in the final sizes for the various networks on which the disease can survive (for the chosen parameter values) with vaccination strategies for the cases where there is no delay in the vaccination. fig. 4 and table 6 show the incidence curves and the reductions in final sizes for the same parameters as used in fig. 3 and table 5 but with a delay of 40 days in the vaccination. as can be expected for the given parameters, a delay has the biggest effect for the scale-free network. in that case, the epidemic is already past its peak and vaccinations only have a minor effect. for the random and meta-random networks, the table 3 illustration of vaccination strategies. susceptible nodes are depicted by triangles, infectious nodes by squares, and the vaccinated nodes by circles. the average degree in these illustrations has been reduced to aid clarity. the starting point for contact tracing is labeled as a while the starting point for the follow links strategy is labeled as b. the number of doses dispensed in this illustration is 3. random follow links contact tracing table 3 for the network topologies in table 1 given a fixed transmission rate b. there is no delay in the vaccination and parameters are equal to those used in fig. 1 . to further investigate the effects of delay in the case of random vaccination, we compute reductions in final sizes for delays of 5, 10, 15,y,150 days, in random, scale-free, and meta-random networks. fig. 5 shows that, not surprisingly, these reductions diminish with longer delays. however, the reductions are strongly network dependent. on a scale-free network, the reduction becomes negligible as the delay approaches the epidemic peak time, while on random and meta-random networks, the reduction is about 40% with the delay at the epidemic peak time. this section clearly shows that given a certain transmission rate b, the effectiveness of a vaccination strategy is impossible to predict without having reliable data on the network topology of the population. next, we consider the case where instead of the transmission rate, the initial growth rate is given. we line up incidence curves on various network topologies to a growth rate l predicted by a homogeneous mixing sir model with the basic reproduction number r 0 ¼ 1:5 and recovery rate g ¼ 0:2 (in this case with exponential, l ¼ ðr 0 à1þg ¼ 0:1). table 7 summarizes the transmission rates that yield this exponential growth rate on the corresponding network topologies. the initial number of infectious individuals for models on each network topology needs to be adjusted as well so that the curves line up along the homogeneous mixing sir incidence curve for 25 days. as can be seen from the table, the variations in the parameters are indeed very large, with the transmission rate for the small world network being nearly 8 times the value of the transmission rate for the scale-free network. the incidence curves corresponding to the parameters in table 7 are shown in fig. 6 . as can clearly be seen, for these parameters, the curves overlap very well for the first 25 days, thus showing indeed the desired identical initial growth rates. however, it is also clear that the curves diverge strongly later on, with the epidemic on the small world network being the most severe. these results show that the spread of an epidemic cannot be predicted on the basis of having a good estimate of the growth rate alone. in addition, comparing figs. 1 and 6, a higher transmission rate yields a much larger final size and a longer epidemic on the meta-random network. the effects of the various vaccination strategies for the case of a given growth rate are shown in fig. 7 . given the large differences in the transmission rates, it may be expected that the final sizes show significant differences as well. this is indeed the case as can be seen in table 8 , which shows the percentage reduction in final sizes for the various vaccination strategies. with no vaccination, the final size of the small world network is more than 3 times that of the scale-free network, but for all except the follow links vaccination strategy the percentage reduction on the small world network is greater. the effects of a 40-day delay in the start of the vaccination are shown in fig. 8 and table 9 . besides the delay, all the parameters are identical to those in fig. 7 and table 8 . the delay has the largest effect on the final sizes of the small world network, increasing it by a factor of 20-30 except in the follow links case. on a scale-free network, the delay renders all vaccination strategies nearly ineffective. these results also confirm the importance of network topology in disease spread even when the incidence curves have identical initial growth. the initial stages of an epidemic are insufficient to estimate the effectiveness of a vaccination strategy on reducing the peak or final size of an epidemic. the relative importance of network topology on the predictability of incidence curves was investigated. this was done by considering whether the effectiveness of several vaccination strategies is impacted by topology, and whether the growth in the daily incidences has a network topology independent relation with the disease transmission rate. it was found that without a fairly detailed knowledge of the network topology, initial data cannot predict epidemic progression. this is so for both a given transmission rate b and a given growth rate l. for a fixed transmission rate and thus a fixed per link transmission probability, given that a disease spreads on a network with a fixed average degree, the disease spreads fastest on scale-free networks because high degree nodes have a very high probability to be infected as soon as the epidemic progresses. in turn, once a high degree node is infected, on average it passes on the infection to a large number of neighbors. the random and meta-random networks show identical initial growth rates because they have the same local network topology. on different table 1 without vaccination for the case where the initial growth rate is given. the transmission rates and initial number of infections for the various network topologies are given in table 7 , while the remaining parameters are the same as in fig. 1 meta-random network fig. 7 . the effects of the vaccination strategies for different topologies when the initial growth rate is given. the transmission rates b are as indicated in table 7 , while the remaining parameters are identical to those in fig. 6 . network topologies, diseases respond differently to parameter changes. for example, on the random network, a higher transmission rate yields a much shorter epidemic, whereas on the metarandom network, it yields a longer one with a more drastic increase in final size. these differences are caused by the spatial structures in the meta-random network. considering that a metarandom network is a random network of random networks, it is likely that the meta-random network represents a general population better than a random network. for a fixed exponential growth rate, the transmission rate needed on the scale-free network to yield the given initial growth rate is the smallest, being about half that of the random and the meta-random networks. hence, the per-link transmission probability is the lowest on the scale-free network, which in turn yields a small epidemic final size. for different network topologies, we quantified the effect of delay in the start of vaccination. we found that the effectiveness of vaccination strategies decreases with delay with a rate strongly dependent on network topology. this emphasizes the importance of the knowledge of the topology, in order to formulate a practical vaccination schedule. with respect to policy, the results presented seem to warrant a significant effort to obtain a better understanding of how the members of a population are actually linked together in a social network. consequently, policy advice based on the rough estimates of the network structure should be viewed with caution. this work is partially supported by nserc discovery grants (jm, pvdd) and mprime (pvdd). we thank the anonymous reviewers for their constructive comments. the nodes in the network are labeled by their infectious status, i.e. susceptible, infectious, vaccinated, immune, refusing vaccination (but susceptible), and vaccinated but susceptible (the vaccine is not working), respectively. the stochastic simulation is initialized by first labeling all the nodes as susceptible and then randomly labeling i 0 nodes as infectious. then, before the simulation starts, 50% of susceptible nodes are labeled as refusing vaccination but susceptible. during the simulation, when a node is vaccinated, the vaccine has a probability of 30% to be ineffective. if it is not effective, the node remains fully susceptible, but will not be vaccinated again. if it is effective, then the immunity is built up linearly over a certain period of time, taken as 2 weeks. we assume that infected persons generally recover in about 5 days, giving a recovery rate g ¼ 0:2. the initial number of infectious individuals i 0 is set to 100 unless otherwise stated, to reduce the number of runs where the disease dies out due to statistical fluctuations. all simulation results presented in sections 4 and 5 are averages of 100 runs, each with a new randomly generated network of the chosen topology. the parameters in the simulations are shown in table 4 . the population size n was chosen to be sufficiently large to be representative of a medium size town and set to n ¼ 200,000, while the degree average is taken as /ks ¼ 5 with a maximum degree m¼ 100 (having a maximum degree only affects the scalefree network since the probability of a node having degree m is practically zero for the other network types). when considering a large group of people, a good first approximation is that the links between these people are random. although it is clear that this cannot accurately represent the population since it lacks, for example, clustering and spatial aggregation (found in such common contexts as schools and work places), it may be possible that if the population is big enough, most if not all nonrandom effects average out. furthermore, random networks lend themselves relatively well to analysis so that a number of interesting (and testable) properties can be derived. as is usually the case, the random network employed here originates from the concepts first presented rigorously by erd + os and ré nyi (1959). our random networks are generated as follows: (1) we begin by creating n unlinked nodes. (2) in order to avoid orphaned nodes, without loss of generality, first every node is linked to another uniformly randomly chosen node that is not a neighbor. (3) two nodes that are not neighbors and not already linked are uniformly randomly selected. if the degree d of both the nodes is less than the maximum degree m, a link is established. if one of the nodes has maximum degree m, a new pair of nodes is uniformly randomly selected. (4) step 3 is repeated n â /ksàn times. when considering certain activities in a population, such as the publishing of scientific work or sexual contact, it has been found that the links are often well described by a scale-free network structure where the relationship between the degree and the number of nodes that have this degree follows a negative power law; see, for example, the review paper by albert and barabá si (2002) . scale-free networks can easily be constructed with the help of a preferential attachment. that is to say, the network is built up step by step and new nodes attach to existing nodes with a probability that is proportional to the degree of the existing nodes. our network is constructed with the help of preferential attachment, but two modifications are made in order to render the scale-free network more comparable with the other networks investigated here. first, the maximum degree is limited to m not by restricting the degree from the outset but by first creating a scale-free network and then pruning all the nodes with a degree larger than m. second, the number of links attached to each new node is either two or three dependent on a certain probability that is set such that after pruning the average degree is very close to that of the random network (i.e. /ks ¼ 5). our scale-free network is generated as follows: (1) start with three fully connected nodes and set the total number of links l¼3. (2) create a new node. with a probability of 0.3, add 2 links. otherwise add 3 links. for each of these additional links to be added find a node to link to as outlined in step 3. (3) loop through the list of nodes and create a link with probability d=ð2lþ, where d is the degree of the currently considered target node. (4) increase l by 2 or 3 depending on the choice in step 2. (5) repeat nà3 times steps 2 and 3. (6) prune nodes with a degree 4 m. small world networks are characterized by the combination of a relatively large number of local links with a small number of non-local links. consequently, there is in principle a very large number of possible small world networks. one of the simplest ways to create a small world network is to first place nodes sequentially on a circle and couple them to their neighbors, similar to the way many coupled map lattices are constructed (willeboordse, 2006) , and to then create some random short cuts. this is basically also the way the small world network used here is generated. the only modification is that the coupling range (i.e. the number of neighbors linked to) is randomly varied between 2 and 3 in order to obtain an average degree equal to that of the random network (i.e. /ks ¼ 5). we also use periodic boundary conditions, which as such is not necessary for a small world network but is commonly done. the motivation for studying small world networks is that small groups of people in a population are often (almost) fully linked (such as family members or co-workers) with some connections to other groups of people. our small world network is generated as follows: (1) create n new unlinked nodes with index i ¼ 1 . . . n. (2) with a probability of 0.55, link to neighboring and second neighboring nodes (i.e. create links i2ià1, i2iþ 1, i2ià2, i2iþ 2). otherwise, also link up to the third neighboring nodes (i.e. create links i2ià1, i2i þ1, i2ià2, i2i þ2, i2ià3, i2i þ3). periodic boundary conditions are used (i.e. the left nearest neighbor of node 1 is node n while the right nearest neighbor of node n is node 1). (3) create the 'large world' network by repeating step 2 for each node. (4) with a probability of 0.05 add a link to a uniformly randomly chosen node excluding self and nodes already linked to. (5) create the small world network by carrying out step 4 for each node. in the random network, the probability for an arbitrary node to be linked to any other arbitrary node is constant and there is no clear notion of locality. in the small world network on the other hand, tightly integrated local connections are supplemented by links to other parts of the network. to model a situation in between where randomly linked local populations (such as the populations of villages in a region) are randomly linked to each other (for example, some members of the population of one village are linked to some members of some other villages), we consider a meta-random network. when increasing the number of shortcuts, a meta-random network transitions to a random network. it can be argued that among the networks investigated here, a meta-random network is the most representative of the population in a state, province or country. our meta-random network is generated as follows: (1) create n new unlinked nodes with index i ¼ 1 . . . n. (2) group the nodes into 100 randomly sized clusters with a minimum size of 20 nodes (the minimum size was chosen such that it is larger than /ks, which equals five throughout, to exclude fully linked graphs). this is done by randomly choosing 99 values in the range from 1 to n to serve as cluster boundaries with the restriction that a cluster cannot be smaller than the minimum size. (3) for each cluster, create an erd + os-ré nyi type random network. (4) for each node, with a probability 0.01, create a link to a uniformly randomly chosen node of a uniformly randomly chosen cluster excluding its own cluster. the network described in this subsection is a near neighbor network and therefore mostly local. nevertheless, there are some shortcuts but shortcuts to very distant parts of the network are not very likely. it could therefore be called a medium world network (situated between small and large world networks). the key feature of this network is that despite being mostly local its degree distribution is identical to that of the random network. our near neighbor network is generated as follows: (1) create n new unlinked nodes with index i ¼ 1 . . . n. (2) for each node, set a target degree by randomly choosing a degree with a probability equal to that for the degree distribution of the random network. (3) if the node has reached its target degree, continue with the next node. if not continue with step 4. (4) with a probability of 0.5, create a link to a node with a smaller index, otherwise create a link to a node with a larger index (using periodic boundary conditions). (5) starting at the nearest neighbor by index and continuing by decreasing (smaller indices) or increasing (larger indices) the index one by one while skipping nodes already linked to, search for the nearest node that has not reached its target degree yet and create a link with this node. (6) create the network by repeating steps 3-5 for each node. for all the strategies, vaccination is voluntary and quantity limited. that is to say only susceptibles who do not refuse vaccination are vaccinated and each day only a certain number of doses is available. for each strategy for each time unit, first a group of eligible nodes is identified and then up to the maximum number of doses is dispensed among the eligible nodes according to the strategy chosen. in this strategy, nodes with the highest degrees are vaccinated first. the motivation for this strategy is that high degree nodes on average can be assumed to transmit a disease more often than low degree nodes. numerically, the prioritized vaccination strategy is implemented as follows: (1) for each time unit, start at the highest degree (i.e. consider nodes with degree d¼m) and repeat the steps below until either the number of doses per time step or the total number of available doses is reached. (2) count the number of susceptible nodes for degree d. (3) if the number of susceptible nodes with degree d is zero, set d ¼ dà1 and return to step 2. (4) if the number of susceptible nodes with degree d is smaller than or equal to the number of available doses, vaccinate all the nodes, then set d ¼ dà1 and continue with step 2. otherwise continue with step 5. (5) if the number of susceptible nodes with degree d is greater than the number of currently available doses, randomly choose nodes with degree d to vaccinate until the available number of doses is used up. (6) when all the doses are used up, end the vaccination for the current time unit and continue when the next time unit arrives. in practice prioritizing on the basis of certain target groups such as health care workers or people at high risk of complications can be difficult. prioritizing on the basis of the number of links is even more difficult. how would such individuals be identified? one of the easiest vaccination strategies to implement is random vaccination. numerically, the random vaccination strategy is implemented as follows: (1) for each time unit, count the total number of susceptible nodes. (2) if the total number of susceptible nodes is smaller than or equal to the number of doses per unit time, vaccinate all the susceptible nodes. otherwise do step 3. (3) if the total number of susceptible nodes is larger than the number of doses per unit time, randomly vaccinate susceptible nodes until all the available doses are used up. one way to reduce the spread of a disease is by splitting the population into many isolated groups. this could be done by vaccinating nodes with links to different groups. however given the network types studied here, breaking links between groups is not really feasible since besides the random cluster network, there is no clear group structure in the other networks. another approach is the follow links strategy, inspired by notions from social networks, where an attempt is made to split the population by vaccinating the neighbors and the neighbor's neighbors and so on of a randomly chosen susceptible node. numerically, the follow links strategy is implemented as follows: (1) count the total number of susceptible nodes. (2) if the total number of susceptible nodes is smaller than or equal to the number of doses per unit time, vaccinate all the susceptible nodes. (3) if the total number of susceptible nodes is greater than the number of available doses per unit time, first randomly choose a susceptible node, label it as the current node, and vaccinate it. (4) vaccinate all the susceptible neighbors of the current node. (5) randomly choose one of the neighbors of the current node. (6) set the current node to the node chosen in step 5. (7) continue with steps 4-6 until all the doses are used up or no available susceptible neighbor can be found. (8) if no available susceptible neighbor can be found in step 7, randomly choose a susceptible node from the population and continue with step 4. contact tracing was successfully used in combating the sars virus. in that case, everyone who had been in contact with an infectious individual was isolated to prevent a further spread of the disease. de facto, this kind of isolation boils down to removing links rendering the infectious node degree 0, a scenario not considered here. here contact tracing tries to isolate an infectious node by vaccinating all its susceptible neighbors. numerically, the contact tracing strategy is implemented as follows: (1) count the total number of susceptible nodes. (2) if the total number of susceptible nodes is smaller than or equal to the number of doses per unit time, vaccinate all the susceptible nodes. (3) count only those susceptible nodes that have an infectious neighbor. (4) if the number of susceptible nodes neighboring an infectious node is smaller than or equal to the number of doses per unit time, vaccinate all these nodes. (5) if the number of susceptible nodes neighboring an infectious node is greater than the number of available doses repeat step 6 until all the doses are used up. (6) randomly choose an infectious node that has susceptible neighbors and vaccinate its neighbors until all the doses are used up. statistical mechanics of complex networks infectious diseases of humans a comparative analysis of influenza vaccination programs when individual behaviour matters: homogeneous and network models in epidemiology compartmental models in epidemiology comparative estimation of the reproduction number for pandemic influenza from daily case notification data trial of 2009 influenza a (h1n1) monovalent mf59-adjuvanted vaccine efficient immunization strategies for computer networks and populations vaccination against 2009 pandemic h1n1 in a population dynamical model of vancouver, canada: timing is everything early real-time estimation of the basic reproduction number of emerging infectious diseases. phys. rev. x 2, 031005. erd + os modeling infectious diseases in humans and animals infectious disease control using contact tracing in random and scale-free networks the effect of network mixing patterns on epidemic dynamics and the efficacy of disease contact tracing effective degree network disease models transmission dynamics and control of severe acute respiratory syndrome generality of the final size formula for an epidemic of a newly invading infectious disease effective degree household network disease models immunization and epidemic dynamics in complex networks network theory and sars: predicting outbreak diversity a note on a paper by erik volz: sir dynamics in random networks effective vaccination strategies for realistic social networks edge based compartmental modelling for infectious disease spread epidemic incidence in correlated complex networks office of the provincial health officer, 2010. b.c.s response to the h1n1 pandemic initial human transmission dynamics of the pandemic (h1n1) 2009 virus in north america dynamics and control of diseases in networks with community structure a high-resolution human contact network for infectious disease transmission networks, epidemics and vaccination through contact tracing estimated epidemiologic parameters and morbidity associated with pandemic h1n1 influenza sir dynamics in random networks with heterogeneous connectivity effects of heterogeneous and clustered contact patterns on infectious disease dynamics dynamical advantages of scale-free networks key: cord-127900-78x19fw4 authors: leung, abby; ding, xiaoye; huang, shenyang; rabbany, reihaneh title: contact graph epidemic modelling of covid-19 for transmission and intervention strategies date: 2020-10-06 journal: nan doi: nan sha: doc_id: 127900 cord_uid: 78x19fw4 the coronavirus disease 2019 (covid-19) pandemic has quickly become a global public health crisis unseen in recent years. it is known that the structure of the human contact network plays an important role in the spread of transmissible diseases. in this work, we study a structure aware model of covid-19 cgem. this model becomes similar to the classical compartment-based models in epidemiology if we assume the contact network is a erdos-renyi (er) graph, i.e. everyone comes into contact with everyone else with the same probability. in contrast, cgem is more expressive and allows for plugging in the actual contact networks, or more realistic proxies for it. moreover, cgem enables more precise modelling of enforcing and releasing different non-pharmaceutical intervention (npi) strategies. through a set of extensive experiments, we demonstrate significant differences between the epidemic curves when assuming different underlying structures. more specifically we demonstrate that the compartment-based models are overestimating the spread of the infection by a factor of 3, and under some realistic assumptions on the compliance factor, underestimating the effectiveness of some of npis, mischaracterizing others (e.g. predicting a later peak), and underestimating the scale of the second peak after reopening. epidemic modelling of covid-19 has been used to inform public health officials across the globe and the subsequent decisions have significantly affected every aspect of our lives, from financial burdens of closing down businesses and the overall economical crisis, to long term affect of delayed education, and adverse effects of confinement on mental health. given the huge and long-term impact of these models on almost everyone in the world, it is crucial to design models that are as realistic as possible to correctly assess the cost benefits of different intervention strategies. yet, current models used in practice have many known issues. in particular, the commonly-used compartment based models from classical epidemiology do not consider the structure of the real world contact networks. it has been shown previously that contact network structure changes the course of an infection spread significantly (keeling 2005; bansal, grenfell, and meyers 2007) . in this paper, we demonstrate the structural effect of different underlying contact networks in covid-19 modelling. standard compartment models assume an underlying er contact network, whereas real networks have a non-random structure as seen in montreal wifi example. in each network, two infected patients with 5 and 29 edges are selected randomly and the networks in comparison have the same number of nodes and edges. in wifi network, infected patients are highly likely to spread their infection in their local communities while in er graph they have a wide-spread reach. non-pharmaceutical interventions (npis) played a significant role in limiting the spread of covid-19. understanding effectiveness of npis is crucial for more informed policy making at public agencies (see the timeline of npis applied in canada in table 2 ). however, the commonly used compartment based models are not expressive enough to directly study different npis. for example, ogden et al. (2020) described the predictive modelling efforts for covid-19 within the public health agency of canada. to study the impact of different npis, they used an agent-based model in addition to a separate deterministic compartment model. one significant disadvantage of the compartment model is its inability to realistically model the closure of public places such as schools and universities. this is due to the fact that compartment models assume that each individual has the same probability to be in contact with every other individual in the population which is rarely true in reality. only by incorporating real world contact networks into compartment models, one can disconnect network hubs to realistically simulate the effect of closure. therefore, ogden et al. (2020) need to rely on a separate stochastic agent-based model to model the closure of public places. in contrast, our proposed cgem is able to directly model all npis used in practice realistically. in this work, we propose to incorporate structural information of contact network between individuals and show the effects of npis applied on different categories of contact networks. in this way, we can 1) more realistically model various npis, 2) avoid the imposed homogeneous mixing assumption from compartment models and utilize different networks for different population demographics. first, we perform simulations on various synthetic and real world networks to compare the impact of the contact network structure on the spread of disease. second, we demonstrate that the degree of effectiveness of npis can vary drastically depending on the underlying structure of the contact network. we focus on the effects of 4 widely adopted npis: 1) quarantining infected and exposed individuals, 2) social distancing, 3) closing down of non-essential work places and schools, and 4) the use of face masks. lastly, we simulate the effect of re-opening strategies and show that the outcome will depend again on the assumed underlying structure of the contact networks. to design a realistic model of the spread of the pandemic, we also used a wifi hotspot network from montreal to simulate real world contact networks. given our data is from montreal, we focus on studying montreal timeline but the basic principles are valid generally and cgem is designed to be used with any realistic contact network. we believe that cgem can improve our understanding on the current covid-19 pandemic and be informative for public agencies on future npi decisions. summary of contributions: • we show that structure of the contact networks significantly changes the epidemic curves and the current compartment based models are subject to overestimating the scale of the spread • we demonstrate the degree of effectiveness of different npis depends on the assumed underlying structure of the contact networks • we simulate the effect of re-opening strategies and show that the outcome will depend again on the assumed underlying structure of the contact networks reproducibility: code for the model and synthetic network generation are in supplementary material. the real-world data can be accessed through the original source. different approaches have accounted for network structures in epidemiological modelling. degree block approximation (barabási et al. 2016 ) considers the degree distribution of the network by grouping nodes with the same degree into the same block and assuming that they have the same behavior. percolation theory methods (newman 2002) can approximate the final size of the epidemic for networks with specified degree distributions. recently, sambaturu et al. (2020) (vogel 2020; lawson et al. 2020) design effective vaccination strategies based on real and diverse contact networks. various modifications are made to the compartment differential equations to account for the network effect (aparicio and pascual 2007; keeling 2005; bansal, grenfell, and meyers 2007) . simulation-based approaches are often used when the underlying networks are complex and mathematically intractable. grefenstette et al. (2013) employed an agent-based model to simulate the dynamics of the seir model with a census-based synthetic population. the contact networks are implied by the behavior patterns of the agents. chen et al. (2020) adopted the independent cascade (ic) model (saito, nakano, and kimura 2008) to simulate the disease propagation and used facebook network as a proxy for the contact network. social networks, however, are not always a good approximation for the physical contact networks. in our study, we attempt to better ground the simulations by inferring the contact networks from wifi hub connection records. table 2 : cgem can realistically model all npis used in practice while existing models miss one or more npis period), prevalence of hospital admissions and icu use, and death. they assumed the effect of physical-distancing measures were to reduce the number of contacts per day across the entire population. in addition, enhanced testing and contact tracing were assumed to move individuals with nonsevere symptoms from the infectious to isolated compartments. in this work, we also examine the effect of closure of public places which is difficult to simulate in a realistic manner for standard compartment models. ogden et al. (2020) described the predictive modelling efforts for covid-19 within the public health agency of canada. they estimated that more than 70% of the canadian population may be infected by covid-19 if no intervention is taken. they proposed an agent-based model and a deterministic compartment model. in the compartment model, similar to tuite, fisman, and greer (2020), effects of physical distancing are modelled by reducing daily per capita contact rates. the agent model is used to separately simulate the effects of closing schools, workplaces and other public places. in this work, we compare the effects all npis used in practice through a unified model and show how different contact networks change the outcome of npis. in addition, ferguson et al. (2020) employed an individual-based simulation model to evaluate the impact of npis, such as quarantine, social distancing and school closure. the number of deaths and icu bed demand are used as proxies to compare the effectiveness of npis. in comparison, our model can directly utilize contact networks and we also model the impact of wearing masks. block et al. (2020) proposed three selective social distancing strategies based on the observations that epidemic dynamics depends on the network structure. the strategies aim to increase network clustering and eliminate shortcuts and are shown to be more effective than naive social distancing. reich, shalev, and kalvari (2020) proposed a selective social distancing strategy which lower the mean degree of the network by limiting super-spreaders. the authors also compared the impact of various npis, including testing, contact tracing, quarantine and social distancing. neural network based approaches (soures et al. 2020; dan-dekar and barbastathis 2020) are also proposed to estimate the effectiveness of quarantine and forecast the spread of the disease. in a classic seir model, referred to as base seir, the dynamics of the system at each time step can be described by the following equations (aron and schwartz 1984) : where an individual can be in one of the 4 states: (s) susceptible, (e) exposed, (i) infected and can infect nodes that are susceptible, and (r) recovered at any given time step t. β, σ, γ are the transition rates from s to e, e to i, and i to r respectively. similarly, in cgem, an individual can be either s susceptible, e exposed, i infected or r recovered. we do not consider reinfection, but extensions are straightforward. unlike the equation-based seir model which assumes homogeneous mixing, cgem takes into account the contact patterns between the individuals by simulating the spread of a disease over a contact network. each individual becomes a node in the network and the edges represent the connections between people. algorithm 1 shows the pseudo code for cgem 1 . given a contact network, we assume that a node comes into contact with all its neighbours at each time step. more specifically, at each time step, the susceptible neighbours of infected individuals will become infected with a transmission probability φ, and enter the exposed state (illustrated below). we randomly select exposed nodes to become infected with probability σ and let them recover with a probability γ. (barabási et al. 2016) , the parameters of the synthetic graph generation could be adjusted to produce graphs with same sizes thus facilitating a fair comparison between different structures. we discuss details in the following sections. inferring transmission rate by definition, β represents the likelihood that a disease is transmitted from an infected to a susceptible in a unit time. barabási et al. (2016) assumes that on average each node comes into contact with k neighbors, then the relationship between β and the transmission rate φ can be expressed as: where k is the average degree of the nodes. in the case of a regular random network, all nodes have the same degree, i.e. k = k and equation 1 can be reduced into: β = k · φ (2) the homogeneous mixing assumption made by the standard seir model can be well simulated by running cgem over a regular random network, we propose to bridge the two models with the following procedure: 1. fit the classic seir model to real data to estimate β. 2. run cgem over regular random networks with different values of k and with φ derived from equation 2. 3. choose k = k * which produce the best fit to the predictions of the classic seir model. the regular random network with average degree k * would be the contact network the classic seir model is approximating and φ * = β/k * would be the implied transmission rate. we will use this transmission rate for other contact networks studied, so that the dynamics of the disease (transmissibility) is fixed and only the structure of contact graph changes. tuning synthetic network generators as a proxy for actual contact networks which are often not available, we can pair cgem with synthetic networks with more realistic properties, comparable to real world networks e.g. heavy-tail degree distribution, small average shortest path, etc. to adjust the parameters of these generators, we can reframe the problem as: given transmission rate φ * and population size n, are there other networks which can produce the same infection curve? for this, we can carry out similar procedures as above. for example, we can run cgem with transmission rate φ * over scale-free networks generated from different values of m ba , where m ba is the number of edges a new node can form in the barabasi albert algorithm (barabási et al. 2016) . m ba which produces the best fit to the infection curve gives us a synthetic contact network that is realistic in terms of number of edges compared to the real contact network. here we explain how different npis can be modelled directly in cgem as changes in the underlying structure. quarantine how can we model the quarantining and selfisolation of exposed and infected individuals? exposed individuals have come into close contact with an infected person and are considered to have high risk of contracting. in an ideal world, most, if not all, infected individuals would be easily identifiable and quarantined. however, in reality, over 40% (he et al. 2020 ) of infected cases are asymptomatic and not all are identified immediately or at all and therefore can go on to infect others unintentionally. to account for this in our model, we apply quarantining by removing all edges from a subset of exposed and infected nodes. social distancing social distancing reduces opportunities of close contacts between individuals by limiting contacts to those from the same household and staying at least 6 feet apart from others when out in public. in cgem, a percentage of edges from each node are removed to simulate the effects of social distancing to different extent. wearing masks masks are shown to be effective in reducing the transmission rate of covid-19 with a relative risk (rr) of 0.608 (ollila et al. 2020) . we simulate this by assigning a mask wearing state to each node and varying the transmissibility, φ, based on whether 2 nodes in contact are wearing masks or not. we define the new transmission rate with this npi, φ mask as follows: if both nodes wearing masks m 1 · φ, if 1 node wearing masks m 0 · φ, otherwise closure: removing hubs places of mass gathering (e.g. schools and workplaces) put large number of people in close proximity. if infected individuals are present in these locations, they can have a large number of contacts and very quickly infect many others. in a network, these nodes with a high number of connections, or degree, are known as hubs. by removing the top degree hubs, we simulate the effects of cancelling mass gathering, and closing down schools and non-essential workplaces. in cgem, we remove all edges from r% of top degree nodes to simulate the closure of schools and non-essential workplaces. however, some hubs, such as (workers in) grocery stores and some government agencies, must remain open, so we assign each hub a successful removal rate of p success to control this effect. compliance given the npis are complied by majority but not all the individuals, we randomly assign a fixed percentage of the nodes as non-compilers. we set this to 26% in all the simulations based on a recent survey (bricker 2020) . due to the economical and psychological impacts of a complete lockdown on the society, it is critical to know how safe it is to resume commercial and social activities once the pandemic has stabilized. therefore, we also investigate the impact of relaxing each npis and the risk of a second wave infection. more specifically, we simulate a complete reversing of the npis, by adding back the edges that were removed when the npi was applied at first, to return the underlying structure to its original form. we compare the spread of covid-19 with synthetic and real world networks. these networks include 3 synthetic networks, (1) the regular random network, where all nodes have the same degree, (2) the erdős-reńyi random network, where the degree distribution is poisson distributed, (3) the barabasi albert network, where the degree distributions follows a power law. additionally, we analyzed 4 real world network, the usc35 network from the face-book100 dataset (traud, mucha, and porter 2012) , consisting of facebook friendship relationship links between students and staffs at the university of southern california in september 2005, and 3 snapshots of a real world wifi hotspot network from montreal , a network often used as a proxy for human contact network while studying disease transmission yang et al. 2020 ). in the montreal wifi network, edges are formed between nodes (mobile phones) that are connected to the same public wifi hub at the same time. as shown in table 3 , each of the 7 networks consist of 17,800 nodes, consistent with 1/100th of the population of the city of montreal, and have between 110,000 to 220,000 edges, with the exception of the usc network. due to the aggregated nature of the usc dataset, edge sampling is enforced during the contact phase in order to obtain reasonable disease spread. the synthetic networks are in general more closely connected than the montreal wifi networks, despite having similar number of nodes and edges. only the largest connected component is considered in all networks. the structure of the contact network plays an important role in the spread of a disease (bansal, grenfell, and meyers 2007) . it dictates how likely susceptible nodes will come into contact with infected ones and therefore it is crucial to evaluate how the disease will spread on each network with the same initial parameters. here, the classic seir model is fitted against the infection rates from the first of the 100th case in montreal to april 4 to obtain β, which is before any npi is applied. with eq. 2, the transmission rate, φ, is estimated to be 0.0371 and is used across all networks. in all experiments, we also seed the population with the same initial number of 3 exposed nodes and 1 infected node. the parameters used to generate synthetic networks are obtained following the procedures described in the previous session. all results are averaged across 10 runs. the grey shaded region shows the 95% confidence interval of each curve. as shown in figure 2 , the er network fits the base seir model almost perfectly-compare green 'er' and black 'base' curves. observation 1 cgem closely approximates the base seir model when the contact network is assumed to be erdős-reńyi graph. all networks drastically overestimates the spread of covid-19 when compared with real world data. this can be expected to some degree as in this experiment we are projecting the curves assuming no npi is in effect which is not what happened in reality (see 'real' orange curve). however, we observe that all 3 synthetic networks, including the er model exceedingly overshoot, showing almost the entire population getting infected, whereas the real-world wifi networks predict a 3x lower peak. observation 2 assuming an erdős-reńyi graph as the contact network overestimates the impact of covid-19 by more than a factor of 3 when compared with more realistic structures. in order to limit the effects of the pandemic, the federal and provincial governments introduced a number of measures to reduce the spread of covid-19. we simulate the effects of 4 different non-pharmaceutical interventions, or npis, at different strengths to determine their effectiveness. these include, (1) quarantining exposed and infected individuals, (2) social distancing between nodes, (3) removing hubs, and (4) the use of face masks. quarantine we apply quarantining into our model on march 23. where both quebec and canadian government have asked those who returned from foreign travels or experienced flu-like symptoms to self isolate. we remove all edges from 50, 75, and 95% of exposed and infected nodes to simulate various strengths of quarantining. figure 8 displays the effect of quarantining on different graph structures. quarantining infected and exposed nodes both reduces and delays the peak of all infection curve. however, the peak is not delayed as much in the wifi graphs as the er graph predicts, which is important information in planning for the healthcare system. out of all tested npis, applying quarantine has the most profound reduction on all infections curves. observation 3 quarantining delays the peak of infection on the er graph whereas the peak on the real world graphs are lowered but not delayed significantly. social distancing reduces the number of close contacts. different degrees of 10%, 30%, and 50% of edges from each node is removed to simulate this. figure 9 shows the effects of social distancing on the infection curves of each network structures. it is effective in reducing the peak of the pandemic on all networks but again delays the peaks only on synthetic networks. similar to observation 3, we have: observation 4 social distancing delays the peak of infection on the er graph whereas the peak on the real world graphs are lowered but not delayed significantly. removing hubs we remove all edges from 1% of top degree nodes to simulate the closure of schools and 5 and 10% of top degree nodes to simulate the closure of non-essential workplaces. these npis are applied on march 23 respectively, coinciding with the dates of school and non-essential business closure in quebec. p success is set to 0.8 unless otherwise stated. figure 10 shows the effects of removing hubs. this npi is very effective on the ba network and all 3 montreal wifi networks since these networks have a power law degree distribution and hubs are present. however, it is not very effective on the regular and er random networks. observation 5 the er graph significantly underestimates the effect of removing hubs. removing hubs is most effective on networks with a power law degree distribution since hubs act as super spreaders and removing them effectively contains the virus. however, no hubs are present in the er and regular random network, and thus removing hubs reduces to removing random nodes. luckily, real world contact networks have power law degree distributions, making a hubs removal an effective strategy in practice. wearing masks we set m 2 = 0.6, m 1 = 0.8 and m 0 = 1, and use the following transmission rate, φ mask in cgem: if both nodes wearing masks 0.8 · φ, if 1 node wearing masks 1 · φ, otherwise wearing masks is only able to flatten the infection curve on the synthetic networks but does not reduce the final epidemic attack rate, the total size of population infected, as shown in figure 11 . however, in the real world wifi networks, wearing masks is able to both flatten the curve and also significantly reduce the final epidemic attack rate. observation 6 the er graph significantly underestimates the effect of wearing masks in terms of the total decrease in the final attack rate. we experiment with reopening of all the npis, but for brevity we only report the results for allowing hubs, which corresponds to the current reopening of schools and public places. the results form other npis are available in the extended results. for removing hubs, we apply reopening on july 18 (denoted by the second vertical line in figure 7) , after many non-essential businesses and workplaces are allowed to open in quebec. because the synthetic networks estimates that most of the population would be infected before the hubs are reopened, we calibrate the number of infected and recovered individuals at the point of reopening to align with figure 6 : difference between cumulative curves from wearing masks and not wearing masks. the cumulative curves represent the total impact, and the different shows how much drop in final attack rate is estimated with the npi enforced. statistics available in the real world data. therefore the simulation continues after reopening with all the models having the same number of susceptible individuals, otherwise int the er graph, everyone is infected at that point. we can see in figure 7 that er and regular random network significantly underestimates the extent of second wave infections. ba and the wifi networks all show second wave infections with a higher peak than the initial, prompting more caution when considering reopening businesses and schools. observation 7 er graph significantly underestimates the second peak after reopening public places, i.e. allowing back hubs. in this paper, we propose to model covid-19 on contact networks (cgem) and show that such modelling, when compared to traditional compartment based models, gives significantly different epidemic curves. moreover, cgem subsumes the traditional models while providing more expressive power to model the npis. we hope that cgem could be used to achieve more informed policy making when studying reopening strategies for covid-19 . url https building epidemiological models from r 0: an implicit treatment of transmission in networks seasonality and period-doubling bifurcations in an epidemic model when individual behaviour matters: homogeneous and network models in epidemiology network science social networkbased distancing strategies to flatten the covid-19 curve in a post-lockdown world one quarter 26 percent of canadians admit they're not practicing physical distancing as a time-dependent sir model for covid-19 with undetectable infected persons neural network aided quarantine control model estimation of global covid-19 spread impact of non-pharmaceutical interventions (npis) to reduce covid19 mortality and healthcare demand fred (a framework for reconstructing epidemic dynamics): an open-source software system for modeling infectious diseases and control strategies using census-based populations temporal dynamics in viral shedding and transmissibility of covid-19 epidemic wave dynamics attributable to urban community structure: a theoretical characterization of disease transmission in a large network données covid-19 au québec the implications of network structure for epidemic dynamics covid-19: recovery and re-opening tracker crawdad dataset ilesansfil/wifidog situation of the coronavirus covid-19 in montreal spread of epidemic disease on networks predictive modelling of covid-19 in canada face masks prevent transmission of respiratory diseases: a meta-analysis of randomized controlled trials modeling covid-19 on a network: super-spreaders, testing and containment. medrxiv prediction of information diffusion probabilities for independent cascade model designing effective and practical interventions to contain epidemics sir-net: understanding social distancing measures with hybrid neural network model for covid-19 infectious spread social structure of facebook networks mathematical modelling of covid-19 transmission and mitigation strategies in the population of ontario covid-19: a timeline of canada's first-wave response targeted pandemic containment through identifying local contact network bottlenecks montreal wifi network 3 snapshots of the montreal wifi network are used in this paper with the following time periods: 2004-08-27 to 2006-11-30, 2007-07-01 to 2008-02-26, and 2009-12-02 to 2010-03-08 . each entry in the dataset consists of a unique connection id, a user id, node id (wifi hub), timestamp in, and timestamp out. nodes in the network are the users in each connection. an edge forms between users who have connected to the same wifi hub at the same time. connections are sampled with the aforementioned timestamp in dates to obtain ∼ 17800 nodes. since there are many disconnected nodes in the wifi networks, only the giant connected component is used.synthetic networks we compared cgem with the wifi networks with 3 synthetic network models, the regular, er, and ba networks. in each of these models, we set the number of nodes to be 17,800 and fit respective parameters to best match the infection curve of the base model and the number of edges in the wifi networks. table 5 all the experiments have been performed on a stock laptop. the following assumptions are made in cgem:1. individuals who recover from covid-19 cannot be infected again 2. symptomatic and asymptomatic individuals have the same transmission rate and they quarantine with the same probability 3. a certain percentage of the population do not compile with npis regardless of their connection. quarantine figure 8 shows the results of quarantining on all graph structures. quarantining infected and exposed nodes both reduces and delays the peak of all infection curve. however, the peak is not delayed as much in the wifi graphs when compared to the regular and er graphs.social distancing figure 9 shows the results of applying social distancing on all networks. like quarantining, this is effective in reducing the peaks of the infection curve on all networks, but the delay of peaks is only apparent on the synthetic networks.removing hubs figure 10 shows the results of apply school and business closure on all networks. the er and regular random networks significantly underestimates the effect of removing hubs.wearing masks figure 11 shows the results of wearing masks and without on each network. figure 12 shows the infection curves of all the networks with all npis applied. on march 23, 50% social distancing and 50% quaranine is applied, and 10% of hubs are removed with a success rate of 0.8. wearing mask is applied on april 6. the wifi networks more closely resemble the shape of the real infection curve. table 2 key: cord-034824-eelqmzdx authors: guo, chungu; yang, liangwei; chen, xiao; chen, duanbing; gao, hui; ma, jing title: influential nodes identification in complex networks via information entropy date: 2020-02-21 journal: entropy (basel) doi: 10.3390/e22020242 sha: doc_id: 34824 cord_uid: eelqmzdx identifying a set of influential nodes is an important topic in complex networks which plays a crucial role in many applications, such as market advertising, rumor controlling, and predicting valuable scientific publications. in regard to this, researchers have developed algorithms from simple degree methods to all kinds of sophisticated approaches. however, a more robust and practical algorithm is required for the task. in this paper, we propose the enrenew algorithm aimed to identify a set of influential nodes via information entropy. firstly, the information entropy of each node is calculated as initial spreading ability. then, select the node with the largest information entropy and renovate its l-length reachable nodes’ spreading ability by an attenuation factor, repeat this process until specific number of influential nodes are selected. compared with the best state-of-the-art benchmark methods, the performance of proposed algorithm improved by 21.1%, 7.0%, 30.0%, 5.0%, 2.5%, and 9.0% in final affected scale on cenew, email, hamster, router, condmat, and amazon network, respectively, under the susceptible-infected-recovered (sir) simulation model. the proposed algorithm measures the importance of nodes based on information entropy and selects a group of important nodes through dynamic update strategy. the impressive results on the sir simulation model shed light on new method of node mining in complex networks for information spreading and epidemic prevention. complex networks are common in real life and can be used to represent complex systems in many fields. for example, collaboration networks [1] are used to cover the scientific collaborations between authors, email networks [2] denote the email communications between users, protein-dna networks [3] help people gain a deep insight on biochemical reaction, railway networks [4] reveal the structure of railway via complex network methods, social networks show interactions between people [5, 6] , and international trade network [7] reflects the products trade between countries. a deep understanding and controlling of different complex networks is of great significance in information spreading and network connectivity. on one hand, by using the influential nodes, we can make successful advertisements for products [8] , discover drug target candidates, assist information weighted networks [54] and social networks [55] . however, the node set built by simply assembling the nodes and sorting them employed by the aforementioned methods may not be comparable to an elaborately selected set of nodes due to the rich club phenomenon [56] , namely, important nodes tend to overlap with each other. thus, lots of methods aim to directly select a set of nodes are proposed. kempe et al. defined the problem of identifying a set of influential spreaders in complex networks as influence maximization problem [57] , and they used hill-climbing based greedy algorithm that is within 63% of optimal in several models. greedy method [58] is usually taken as the approximate solution of influence maximization problem, but it is not efficient for its high computational cost. chen et al. [58] proposed newgreedy and mixedgreedy method. borgatti [59] specified mining influential spreaders in social networks by two classes: kpp-pos and kpp-neg, based on which he calculated the importance of nodes. narayanam et al. [60] proposed spin algorithm based on shapley value to deal with information diffusion problem in social networks. although the above greedy based methods can achieve relatively better result, they would cost lots of time for monte carlo simulation. so more heuristic algorithms were proposed. chen et al. put forward simple and efficient degreediscount algorithm [58] in which if one node is selected, its neighbors' degree would be discounted. zhang et al. proposed voterank [61] which selects the influential node set via a voting strategy. zhao et al. [62] introduced coloring technology into complex networks to seperate independent node sets, and selected nodes from different node sets, ensuring selected nodes are not closely connected. hu et al. [63] and guo et al. [64] further considered the distance between independent sets and achieved a better performance. bao et al. [65] sought to find dispersive distributed spreaders by a heuristic clustering algorithm. zhou [66] proposed an algorithm to find a set of influential nodes via message passing theory. ji el al. [67] considered percolation in the network to obtain a set of distributed and coordinated spreaders. researchers also seek to maximize the influence by studying communities [68] [69] [70] [71] [72] [73] . zhang [74] seperated graph nodes into communities by using k-medoid method before selecting nodes. gong et al. [75] divided graph into communities of different sizes, and selected nodes by using degree centrality and other indicators. chen et al. [76] detected communities by using shrink and kcut algorithm. later they selected nodes from different communities as candidate nodes, and used cdh method to find final k influential nodes. recently, some novel methods based on node dynamics have been proposed which rank nodes to select influential spreaders [77, 78] .şirag erkol et al. made a systematic comparison between methods focused on influence maximization problem [79] . they classify multiple algorithms to three classes, and made a detailed explanation and comparison between methods. more algorithms in this domain are described and classified clearly by lü et al. in their review paper [80] . most of the non-greedy strategy methods suffer from a possibility that some spreaders are so close that their influence may overlap. degreediscount and voterank use iterative selection strategy. after a node is selected, they weaken its neighbors' influence to cope with the rich club phenomenon. however, these two algorithms roughly induce nodes' local information. besides, they do not further make use of the difference between nodes when weakening nodes' influence. in this paper, we propose a new heuristic algorithm named enrenew based on node's entropy to select a set of influential nodes. enrenew also uses iterative selection strategy. it initially calculates the influence of each node by its information entropy (further explained in section 2.2), and then repeatedly select the node with the largest information entropy and renovate its l-length reachable nodes' information entropy by an attenuation factor until specific number of nodes are selected. experiments show that the proposed method yields the largest final affected scale on 6 real networks in the susceptible-infected-recovered (sir) simulation model compared with state-of-the-art benchmark methods. the results reveal that enrenew could be a promising tool for related work. besides, to make the algorithm practically more useful, we provide enrenew's source code and all the experiments details on https://github.com/yangliangwei/influential-nodes-identification-in-complex-networksvia-information-entropy, and researchers can download it freely for their convenience. the rest of paper is organized as follows: the identifying method is presented in section 2. experiment results are analyzed and discussed in section 3. conclusions and future interest research topics are given in section 4. the best way to measure the influence of a set of nodes in complex networks is through propagation dynamic process on real life network data. a susceptible infected removed model (sir model) is initially used to simulate the dynamic of disease spreading [23] . it is later widely used to analyze similar spreading process, such as rumor [81] and population [82] . in this paper, the sir model is adopted to objectively evaluate the spreading ability of nodes selected by algorithms. each node in the sir model can be classified into one of three states, namely, susceptible nodes (s), infected nodes (i), and recovered nodes (r). at first, set initial selected nodes to infected status and all others in network to susceptible status. in each propagation iteration, each infected node randomly choose one of its direct neighbors and infect it with probability µ. in the meantime, each infected node will be recovered with probability β and won't be infected again. in this study, λ = µ β is defined as infected rate, which is crucial to the spreading speed in the sir model. apparently, the network can reach a steady stage with no infection after enough propagation iterations. to enable information spreads widely in networks, we set µ = 1.5µ c , where µ c = k k 2 − k [83] is the spreading threshold of sir, k is the average degree of network. when µ is smaller than µ c , spreading in sir could only affect a small range or even cannot spread at all. when it is much larger than µ c , nearly all methods could affect the whole network, which would be meaningless for comparison. thus, we select µ around µ c on the experiments. during the sir propagation mentioned above, enough information can be obtained to evaluate the impact of initial selected nodes in the network and the metrics derived from the procedure is explained in section 2.4. the influential nodes selecting algorithm proposed in this paper is named enrenew, deduced from the concept of the algorithm. enrenew introduces entropy and renews the nodes' entropy through an iterative selection process. enrenew is inspired by voterank algorithm proposed by zhang et al. [61] , where the influential nodes are selected in an iterative voting procedure. voterank assigns each node with voting ability and scores. initially, each node's voting ability to its neighbors is 1. after a node is selected, the direct neighbors' voting ability will be decreased by 1 k , where k = 2 * m n is the average degree of the network. voterank roughly assigns all nodes in graph with the same voting ability and attenuation factor, which ignores node's local information. to overcome this shortcoming, we propose a heuristic algorithm named enrenew and described as follows. in information theory, information quantity measures the information brought about by a specific event and information entropy is the expectation of the information quantity. these two concepts are introduced into complex network in reference [44] [45] [46] to calculate the importance of node. information entropy of any node v can be calculated by: where p uv = d u ∑ l∈γv d l , ∑ l∈γ v p lv = 1, γ v indicates node v's direct neighbors, and d u is the degree of node u. h uv is the spreading ability provided from u to v. e v is node v's information entropy indicating its initial importance which would be renewed as described in algorithm 1. a detailed calculating of node entropy is shown in figure 1 . it shows how the red node's (node 1) entropy is calculated in detail. node 1 has four neighbors from node 2 to node 5. node 1's information entropy is then calculated by simply selecting the nodes with a measure of degree as initial spreaders might not achieve good results. because most real networks have obvious clumping phenomenon, that is, high-impact nodes in the network are often connected closely in a same community. information cannot be copiously disseminated to the whole network. to manage this situation, after each high impact node is selected, we renovate the information entropy of all nodes in its local scope and then select the node with the highest information entropy, the process of which is shown in algorithm 1. e k = − k · 1 k · log 1 k and k is the average degree of the network. 1 2 l−1 is the attenuation factor, the farther the node is from node v, the smaller impact on the node will be. e k can be seen as the information entropy of any node in k -regular graph if k is an integer. from algorithm 1, we can see that after a new node is selected, the renew of its l-length reachable nodes' information entropy is related with h and e k , which reflects local structure information and global network information, respectively. compared with voterank, enrenew replaces voting ability by h value between connected nodes. it induces more local information than directly set voting ability as 1 in voterank. at the same time, enrenew uses h e k as the attenuate factor instead of 1 k in voterank, retaining global information. computational complexity (usually time complexity) is used to describe the relationship between the input of different scales and the running time of the algorithm. generally, brute force can solve most problems accurately, but it cannot be applied in most scenarios because of its intolerable time complexity. time complexity is an extremely important indicator of an algorithm's effectiveness. through analysis, the algorithm is proved to be able to identify influential nodes in large-scale network in limited time. the computational complexity of enrenew can be analyzed in three parts, initialization, selection and renewing. n, m and r represent the number of nodes, edges and initial infected nodes, respectively. at start, enrenew takes o(n · k ) = o(m) for calculating information entropy. node selection selects the node with the largest information entropy and requires o(n), which can further be decreased to o(log n) if stored in an efficient data structure such as red-black tree. renewing the l-length reachable nodes' information entropy needs o( k l ) = o( m l n l ). as suggested in section 3.3, l = 2 yields impressive results with o( m 2 n 2 ). since selection and renewing parts need to be performed r times to get enough spreaders,the final computational complexity is o(m + n) + o(r log n) + o(r k 2 ) = o(m + n + r log n + rm 2 n 2 ). especially, when the network is sparse and r n, the complexity will be decreased to o(n). the algorithm's performance is measured by the selected nodes' properties including its spreading ability and location property. spreading ability can be measured by infected scale at time t f(t) and final infected scale f(t c ), which are obtained from sir simulation and widely used to measure the spreading ability of nodes [61, [84] [85] [86] [87] [88] . l s is obtained from selected nodes' location property by measuring their dispersion [61] . infected scale f(t) demonstrates the influence scale at time t and is defined by where n i(t) and n r(t) are the number of infected and recovered nodes at time t, respectively. at the same time step t, larger f(t) indicates more nodes are infected by initial influential nodes, while a shorter time t indicates the initial influential nodes spread faster in the network. f(t c ) is the final affected scale when the spreading reaches stable state. this reflects the final spreading ability of initial spreaders. the larger the value is, the stronger the spreading capacity of initial nodes. f(t c ) is defined by: where t c is the time when sir propagation procedure reaches its stable state. l s is the average shortest path length of initial infection set s. usually, with larger l s , the initial spreaders are more dispersed and can influence a larger range. this can be defined by: where l u,v denotes the length of the shortest path from node u to v. if u and v is disconnected, the shortest path is replaced by d gc + 1, where d gc is the largest diameter of connected components. an example network shown in figure 2 is used to show the rationality of nodes the proposed algorithm chooses. the first three nodes selected by enrenew is distributed in three communities, while those selected by the other algorithms are not. we further run the sir simulation on the example network with enrenew and other five benchmark methods. the detailed result is shown in table 1 for an in-depth discussion. this result is obtained by averaging 1000 experiments. . this network consists of three communities at different scales. the first nine nodes selected by enrenew are marked red. the network typically shows the rich club phenomenon, that is, nodes with large degree tend to be connected together. table 2 shows the experiment results when choosing 9 nodes as the initial spreading set. greedy method is usually used as the upper bound, but it is not efficient in large networks due to its high time complexity. enrenew and pagerank distribute 4 nodes in community 1, 3 nodes in community 2, and 1 node in community 3. the distribution matches the size of community. however, the nodes selected by the other algorithms tend to cluster in community 1 except for greedy method. this will induce spreading within high density area, which is not efficient to spread in the entire network. enrenew and pagerank can adaptively allocate reasonable number of nodes based on the size of the community just as greedy method. nodes selected by enrenew have the second largest average distance except greedy, which indicates enrenew tends to distribute nodes sparsely in the graph. it aptly alleviates the adverse effect of spreading caused by the rich club phenomenon. although enrenew's average distance is smaller than pagerank, it has a higher final infected scale f(t c ). test result on pagerank also indicates that just select nodes widely spread across the network may not induce to a larger influence range. enrenew performs the closest to greedy with a low computational cost. it shows the proposed algorithm's effectiveness to maximize influence with limited nodes. note: n and m are the total number of nodes and edges, respectively, and k = 2 * m n stands for average node degree and k max = max v∈v d v is the max degree in the network and average clustering coefficient c measures the degree of aggregation in the network. c = 1 n ∑ n i=1 2 * i i |γ i | * (|γ i |−1) , where i i denotes the number of edges between direct neighbors of node i. table 2 describes six different networks varying from small to large-scale, which are used to evaluate the performance of the methods. cenew [89] is a list of edges of the metabolic network of c.elegans. email [90] is an email user communication network. hamster [91] is a network reflecting friendship and family links between users of the website http://www.hamsterster.com, where node and edge demonstrate the web user and relationship between two nodes, respectively. router network [92] reflects the internet topology at the router level. condmat (condense matter physics) [93] is a collaboration network of authors of scientific papers from the arxiv. it shows the author collaboration in papers submitted to condense matter physics. a node in the network represents an author, and an edge between two nodes shows the two authors have collaboratively published papers. in the amazon network [94] , each node represents a product, and an edge between two nodes represents two products were frequently purchased together. we firstly conduct experiments on the parameter l, which is the influence range when renewing the information entropy. if l = 1, only the direct neighbors' importance of selected node will be renewed, and if l = 2, the importance of 2-length reachable nodes will be renewed and so forth. the results with varying parameter l from 1 to 4 on four networks are shown in figure 3 . it can be seen from figure 3 that, when l = 2, the method gets the best performance in four of the six networks. in network email, although the results when l = 3 and l = 4 are slightly better comparing with the case of l = 2, the running time increases sharply. besides, the three degrees of influence (tdi) theory [95] also states that a individual's social influence is only within a relatively small range. based on our experiments, we set the influence range parameter l at 2 in the preceding experiments. with specific ratio of initial infected nodes p, larger final affected scale f(t c ) means more reasonable of the parameter l. the best parameter l differs from different networks. in real life application, l can be used as an tuning parameter. many factors affect the final propagation scale in networks. a good influential nodes mining algorithm should prove its robustness in networks varying in structure, nodes size, initial infection set size, infection probability, and recovery probability. to evaluate the performance of enrenew, voterank , adaptive degree, k-shell, pagerank, and h-index algorithms are selected as benchmark methods for comparing. furthermore, greedy method is usually taken as upper bound on influence maximization problem, but it is not practical on large networks due to its high time computational complexity. thus, we added greedy method as upper bound on the two small networks (cenew and email). the final affected scale f(t c ) of each method on different initial infected sizes are shown in figure 4 . it can be seen that enrenew achieves an impressing result on the six networks. in the small network, such as cenew and email, enrenew has an apparent better result on the other benchmark methods. besides, it nearly reaches the upper bound on email network. in hamster network, it achieves a f(t c ) of 0.22 only by ratio of 0.03 initial infected nodes, which is a huge improvement than all the other methods. in condmat network, the number of affected nodes are nearly 20 times more than the initial ones. in a large amazon network, 11 nodes will be affected on average for one selected initial infected node. but the algorithm performs unsatisfactory on network router. all the methods did not yield good results due to the high sparsity structure of the network. in this sparse network, the information can hardly spread out with small number of initial spreaders. by comparing the 6 methods from the figure 4 , enrenew surpasses all the other methods on five networks with nearly all kinds of p varying from small to large. this result reveals that when the size of initial infected nodes varies, enrenew also shows its superiority to all the other methods. what is worth noticing is that enrenew performs about the same as other methods when p is small, but it has a greater improvement with the rise of initial infected ratio p. this phenomenon shows the rationality of the importance renewing process. the renewing process of enrenew would influence more nodes when p is larger. the better improvement of enrenew than other methods shows the renewing process reasonability redistributes nodes' importance. timestep experiment is made to assess the propagation speed when given a fixed number of initial infected nodes. the exact results of f(t) varying with time step t are shown in figure 5 . from the experiment, it can be seen that with same number of initial infected nodes, enrenew always reaches a higher peak than the benchmark methods, which indicates a larger final infection rate. in the steady stage, enrenew surpasses the best benchmark method by 21.1%, 7.0%, 30.0%, 5.0%, 2.5% and 9.0% in final affected scale on cenew, email, hamster, router, condmat and amazon networks, respectively. in view of propagation speed, enrenew reaches the peak at about 300th time step in cenew, 200th time step in email, 400th time step in hamster, 50th time step in router, 400th time step in condmat and 150th time step in amazon. enrenew always takes less time to influence the same number of nodes compared with other benchmark methods. from figure 5 , it can also be seen that k-shell also performs worst from the early stage in all the networks. nodes with high core value tend to cluster together, which makes information hard to dissipate. especially in the amazon network, after 100 timesteps, all other methods reach a f(t) of 0.0028, which is more than twice as large as k-shell. in contrast to k-shell, enrenew spreads the fastest from early stage to the steady stage. it shows that the proposed method not only achieve a larger final infection scale, but also have a faster infection rate of propagation. in real life situations, the infected rate λ varies greatly and has huge influence on the propagation procedure. different λ represents virus or information with different spreading ability. the results on different λ and methods are shown in figure 6 . from the experiments, it can be observed that in most of cases, enrenew surpasses all other algorithms with λ varying from 0.5 to 2.0 on all networks. besides, experiment results on cenew and email show that enrenew nearly reaches the upper bound. it shows enrenew has a stronger generalization ability comparing with other methods. especially, the enrenew shows its impressing superiority in strong spreading experiments when λ is large. generally speaking, if the selected nodes are widely spread in the network, they tend to have an extensive impact influence on information spreading in entire network. l s is used to measure dispersity of initial infected nodes for algorithms. figure 7 shows the results of l s of nodes selected by different algorithms on 6 different networks. it can be seen that, except for the amazon network, enrenew always has the largest l s , indicting the widespread of selected nodes. especially in cenew, enrenew performs far beyond all the other methods as its l s is nearly as large as the upper bound. in regard to the large-scale amazon network, the network contains lots of small cliques and k-shell selects the dispersed cliques, which makes k-shell has the largest l s . but other experimental results of k-shell show a poor performance. this further confirms that enrenew does not naively distribute selected nodes widely across the network, but rather based on the potential propagation ability of each node. figure 5 . this experiment compares different methods regard to spreading speed. each subfigure shows experiment results on one network. the ratio of initial infected nodes is 3% for cenew, email, hamster and router, 0.3% for condmat and 0.03% for amazon. the results are obtained by averaging on 100 independent runs with spread rate λ = 1.5 in sir. with the same spreading time t, larger f(t) indicates larger influence scale in network, which reveals a faster spreading speed. it can be seen from the figures that enrenew spreads apparently faster than other benchmark methods on all networks. on the small network cenew and email, enrenew's spreading speed is close to the upper bound. 0.5 0. 8 figure 6 . this experiment tests algorithms' effectiveness on different spreading conditions. each subfigure shows experiment results on one network. the ratio of initial infected nodes is 3% for cenew, email, hamster and router, 0.3% for condmat, and 0.03% for amazon. the results are obtained by averaging on 100 independent runs. different infected rate λ of sir can imitate different spreading conditions. enrenew gets a larger final affected scale f(t c ) on different λ than all the other benchmark methods, which indicates the proposed algorithm has more generalization ability to different spreading conditions. . this experiment analysis average shortest path length l s of nodes selected by different algorithms. each subfigure shows experiment results on one network. p is the ratio of initial infected nodes. generally speaking, larger l s indicates the selected nodes are more sparsely distributed in network. it can be seen that nodes selected by enrenew have the apparent largest l s on five networks. it shows enrenew tends to select nodes sparsely distributed. the influential nodes identification problem has been widely studied by scientists from computer science through to all disciplines [96] [97] [98] [99] [100] . various algorithms that have been proposed aim to solve peculiar problems in this field. in this study, we proposed a new method named enrenew by introducing entropy into a complex network, and the sir model was adopted to evaluate the algorithms. experimental results on 6 real networks, varying from small to large in size, show that enrenew is superior over state-of-the-art benchmark methods in most of cases. besides, with its low computational complexity, the presented algorithm can be applied to large scale networks. the enrenew proposed in this paper can also be well applied in rumor controlling, advertise targeting, and many other related areas. but, for influential nodes identification, there still remain many challenges from different perspectives. from the perspective of network size, how to mine influential spreaders in large-scale networks efficiently is a challenging problem. in the area of time-varying networks, most of these networks are constantly changing, which poses the challenge of identifying influential spreaders since they could shift with the changing topology. in the way of multilayer networks, it contains information from different dimensions with interaction between layers and has attracted lots of research interest [101] [102] [103] . to identify influential nodes in multilayer networks, we need to further consider the method to better combine information from different layers and relations between them. the scientific collaboration networks in university management in brazil arenas, a. self-similar community structure in a network of human interactions insights into protein-dna interactions through structure network analysis statistical analysis of the indian railway network: a complex network approach social network analysis network analysis in the social sciences prediction in complex systems: the case of the international trade network the dynamics of viral marketing extracting influential nodes on a social network for information diffusion structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review efficient immunization strategies for computer networks and populations a study of epidemic spreading and rumor spreading over complex networks epidemic processes in complex networks unification of theoretical approaches for epidemic spreading on complex networks epidemic spreading in time-varying community networks suppression of epidemic spreading in complex networks by local information based behavioral responses efficient allocation of heterogeneous response times in information spreading process absence of influential spreaders in rumor dynamics a model of spreading of sudden events on social networks daniel bernoulli?s epidemiological model revisited herd immunity: history, theory, practice epidemic disease in england: the evidence of variability and of persistency of type infectious diseases of humans: dynamics and control thermodynamic efficiency of contagions: a statistical mechanical analysis of the sis epidemic model a rumor spreading model based on information entropy an algorithmic information calculus for causal discovery and reprogramming systems the hidden geometry of complex, network-driven contagion phenomena extending centrality the h-index of a network node and its relation to degree and coreness identifying influential nodes in complex networks identifying influential nodes in large-scale directed networks: the role of clustering collective dynamics of ?small-world?networks identification of influential spreaders in complex networks ranking spreaders by decomposing complex networks eccentricity and centrality in networks the centrality index of a graph a set of measures of centrality based on betweenness a new status index derived from sociometric analysis mutual enhancement: toward an understanding of the collective preference for shared information factoring and weighting approaches to status scores and clique identification dynamical systems to define centrality in social networks the anatomy of a large-scale hypertextual web search engine leaders in social networks, the delicious case using mapping entropy to identify node centrality in complex networks path diversity improves the identification of influential spreaders how to identify the most powerful node in complex networks? a novel entropy centrality approach a novel entropy-based centrality approach for identifying vital nodes in weighted networks node importance ranking of complex networks with entropy variation key node ranking in complex networks: a novel entropy and mutual information-based approach a new method to identify influential nodes based on relative entropy influential nodes ranking in complex networks: an entropy-based approach discovering important nodes through graph entropy the case of enron email database identifying node importance based on information entropy in complex networks ranking influential nodes in complex networks with structural holes ranking influential nodes in social networks based on node position and neighborhood detecting rich-club ordering in complex networks maximizing the spread of influence through a social network efficient influence maximization in social networks identifying sets of key players in a social network a shapley value-based approach to discover influential nodes in social networks identifying a set of influential spreaders in complex networks identifying effective multiple spreaders by coloring complex networks effects of the distance among multiple spreaders on the spreading identifying multiple influential spreaders in term of the distance-based coloring identifying multiple influential spreaders by a heuristic clustering algorithm spin glass approach to the feedback vertex set problem effective spreading from multiple leaders identified by percolation in the susceptible-infected-recovered (sir) model finding influential communities in massive networks community-based influence maximization in social networks under a competitive linear threshold model a community-based algorithm for influence blocking maximization in social networks detecting community structure in complex networks via node similarity community structure detection based on the neighbor node degree information community-based greedy algorithm for mining top-k influential nodes in mobile social networks identifying influential nodes in complex networks with community structure an efficient memetic algorithm for influence maximization in social networks efficient algorithms for influence maximization in social networks local structure can identify and quantify influential global spreaders in large scale social networks identifying influential spreaders in complex networks by propagation probability dynamics systematic comparison between methods for the detection of influential spreaders in complex networks vital nodes identification in complex networks sir rumor spreading model in the new media age stochastic sir epidemics in a population with households and schools thresholds for epidemic spreading in networks a novel top-k strategy for influence maximization in complex networks with community structure identifying influential spreaders in complex networks based on kshell hybrid method identifying key nodes based on improved structural holes in complex networks ranking nodes in complex networks based on local structure and improving closeness centrality an efficient algorithm for mining a set of influential spreaders in complex networks the large-scale organization of metabolic networks the koblenz network collection the network data repository with interactive graph analytics and visualization measuring isp topologies with rocketfuel graph evolution: densification and shrinking diameters defining and evaluating network communities based on ground-truth the spread of obesity in a large social network over 32 years identifying the influential nodes via eigen-centrality from the differences and similarities of structure tracking influential individuals in dynamic networks evaluating influential nodes in social networks by local centrality with a coefficient a survey on topological properties, network models and analytical measures in detecting influential nodes in online social networks identifying influential spreaders in noisy networks spreading processes in multilayer networks identifying the influential spreaders in multilayer interactions of online social networks identifying influential spreaders in complex multilayer networks: a centrality perspective we would also thank dennis nii ayeh mensah for helping us revising english of this paper. the authors declare no conflict of interest. key: cord-148358-q30zlgwy authors: pang, raymond ka-kay; granados, oscar; chhajer, harsh; legara, erika fille title: an analysis of network filtering methods to sovereign bond yields during covid-19 date: 2020-09-28 journal: nan doi: nan sha: doc_id: 148358 cord_uid: q30zlgwy in this work, we investigate the impact of the covid-19 pandemic on sovereign bond yields amongst european countries. we consider the temporal changes from financial correlations using network filtering methods. these methods consider a subset of links within the correlation matrix, which gives rise to a network structure. we use sovereign bond yield data from 17 european countries between the 2010 and 2020 period as an indicator of the economic health of countries. we find that the average correlation between sovereign bonds within the covid-19 period decreases, from the peak observed in the 2019-2020 period, where this trend is also reflected in all network filtering methods. we also find variations between the movements of different network filtering methods under various network measures. the novel coronavirus disease 2019 (covid-19) epidemic caused by sars-cov-2 began in china in december 2019 and rapidly spread around the world. the confirmed cases increased in different cities of china, japan, and south korea in a few days of early january 2020, but spread globally with new cases in iran, spain, and italy within the middle of february. we focus on sovereign bonds during the covid-19 period to highlight the extent to which the pandemic has influenced the financial markets. in the last few years, bond yields across the euro-zone were decreasing under a range of european central bank (ecb) interventions, and overall remained stable compared with the german bund, a benchmark used for european sovereign bonds. these movements were disrupted during the covid-19 pandemic, which has affected the future trajectory of bond yields from highly impacted countries, e.g., spain and italy. however, in the last months, the european central banks intervened in financial and monetary markets to consolidate stability through an adequate supply of liquidity countering the possible margin calls and the risks of different markets and payment systems. these interventions played a specific role in sovereign bonds because, on the one side, supported the stability of financial markets and, on the other side, supported the governments' financial stability and developed a global reference interest rate scheme. understanding how correlations now differ and similarities observed in previous financial events are important in dealing with the future economic effects of covid19. we consider an analysis of sovereign bonds by using network filtering methods, which is part of a growing literature within the area of econophysics [29, 44, 30, 28, 17] . the advantages in using filtering methods is the extraction of a network type structure from the financial correlations between sovereign bonds, which allows the properties of centrality and clustering to be considered. in consequence, the correlation-based networks and hierarchical clustering methodologies allow us to understand the nature of financial markets and some features of sovereign bonds. it is not clear which approach should be used in analyzing sovereign bond yields, and so within this paper, we implement various filtering methods to the sovereign bond yield data and compare the resulting structure of different networks. our analysis shows that over the last decade, the mean correlation peaks in october 2019 and then decreases during the 2020 period, when covid-19 is most active in europe. these dynamics are reflected across all network filtering methods and represent the wide impact of covid-19 towards the spectrum of correlations, compared to previous financial events. we consider the network centrality of sovereign bonds within the covid-19 period, which remains consistent with previous years. these trends are distinctive between filtering methods and stem from the nature of correlations towards economic factors e.g., positive correlations show a stable trend in the individual centrality, compared with the volatile trends for negative correlations, where central nodes within these networks are less integrated in the euro-area. although there is a change in the magnitude of correlations, the overall structure relative to the central node is maintained within the covid-19 period. previous studies have used different methods to analyze historic correlations as random matrix theory to identify the distribution of eigenvalues concerning financial correlations [27, 39, 23] , the approaches from information theory in exploring the uncertainty within the financial system [20, 12] , multilayer network methods [1, 7, 46, 24, 18, 40] , and filtering methods. several authors have used network filtering methods to explain financial structures [31, 37] , hierarchy and networks in financial markets [50] , relations between financial markets and real economy [34] , volatility [51] , interest rates [33] , stock markets [21, 52, 53, 2] , future markets [8] or topological dynamics [45] to list a few. also, the comparison of filtering methods to market data has been used for financial instruments. birch, et al [10] consider a comparison of filtering methods of the dax30 stocks. musmeci, et al [35] propose a multiplex visual network approach and consider data of multiple stock indexes. kukreti, et al [26] use the s&p500 market data and incorporate entropy measures with a range of network filtering methods. aste, et al [5] apply a comparison of network filtering methods on the us equity market data and assess the dynamics using network measures. in order to evaluate the european sovereign bonds, based on filtering methods, this work is organized as follows. in section 2, we describe the network filtering methods and present the data sets with some preliminary empirical analyses. we apply in section 3 the filtering methods to sovereign bond yields and analyze the trend of financial correlations over the last decade and consider aspects of the network topology. we construct plots in section 4 representing the covid-19 period for all methods and analyze the clustering between countries. in section 5, we discuss the results and future directions. we introduce a range of network filtering methods and consider a framework as in [31] for sovereign bond yields. we define n ∈ n to be the number of sovereign bonds and bond yields y i (t) of the ith sovereign bond at time-t, where i ∈ {1, ..., n}. the correlation coefficients r ij (t) ∈ [−1, 1] are defined using pearson correlation as: with · denoting the average of yield values. the notion of distance d ij ∈ [0, 2] considers the values of the entries r ij of the correlation matrix r ∈ [−1, 1] n×n , with d ij = 2(1 − r ij ). a distance of d ij = 0 represents perfectly positive correlations and d ij = 2 represents bonds with negative correlations. the network filtering methods are then applied to the distance matrix d ∈ [0, 2] n×n , where a subset of links (or edges) are chosen under each filtering method. the set of edges is indicated by {(i, j) ∈ e(t) : nodes i and j are connected} at time-t, defined for each filtering method. we define the time frames of financial correlations as x for the set of observations, with n different columns and t rows. from the set of observations x, we consider windows of length 120, which is equal to six months of data values. we then displace δ windows by 10 data points, which is equal to two weeks of data values, and discard previous observations until all data points are used. by displacing the data in this way, we can examine a time series trend between each window x. we verify the statistical reliability of correlations by using a non-parametric bootstrapping approach as in efron [15] , which is used in tumminello, et al [48, 49] . we randomly choose rows equal in number to the window length t , allowing repeated rows to be chosen. we compute the correlation matrix for this window x * m and repeat the procedure until m samples are generated, which is chosen at 10,000. the error between data points described in efron [15] is equal to (1 − ρ 2 )/t , where highly positive and negative correlated values ρ have the smallest errors. the minimum spanning tree (mst) method is a widely known approach which has been used within currency markets [22] , stocks markets [42, 43] and sovereign bond yields [13] . the mst from table 1 considers the smallest edges and prioritizes connections of high correlation to form a connected and undirected tree network. this approach can be constructed from a greedy type algorithm e.g. kruskal's and prim's algorithm and satisfies the properties of subdominant ultrametric distance i.e, d ij ≤ max{d ik , d kj } ∀i, j, k ∈ {1, ..., n}. a maximum spanning tree (mast) constructs a connected and undirected tree network with n − 1 edges in maximizing the total edge weight. analyses involving mast have been used as comparisons to results seen within mst approaches [14, 19] . an mast approach is informative for connections of perfectly anti-correlation between nodes, which are not observed within the mst. a network formed from asset graphs (ag) considers positive correlations between nodes of a given threshold. within the mst, some links of positive correlation are not considered in order to satisfy the properties of the tree network. all n − 1 highest correlations are considered in an ag, allowing for the formation of cliques not observed within a mst or mast network. the use of ag has been considered in onnela, et al [38] , which identifies clustering within stock market data. as the method only considers n − 1 links, some nodes within the ag may not be connected minimum spanning tree (mst) n − 1 [25] a connected and undirected network for n nodes which minimizes the total edge weight. maximum spanning tree (mast) n − 1 [41] a connected and undirected network for n nodes which maximizes the total edge weight. asset graph (ag) n − 1 [36] choose the smallest n−1 edges from the distance matrix. triangulated maximal filtering graph (tmfg) [32] a planar filtered graph under an assigned objective function. for the given threshold and therefore the connection of unconnected nodes is unknown, relative to connected components. the triangulated maximal filtering graph (tmfg) constructs a network of 3(n − 2) fixed edges for n nodes, similar to the planar maximal filtered graph (pmfg) [47] , which has been used to analyze us stock trends [35] . the algorithm initially chooses a clique of 4 nodes, where edges are then added sequentially, in order to optimize the objective function e.g., the total edge weight of the network, until all nodes are connected. this approach is non-greedy in choosing edges and incorporates the formation of cliques within the network structure. a tmfg is also an approximate solution to the weighted planar maximal graph problem, and is computationally faster than the pmfg. the resulting network includes more information about the correlation matrix compared with spanning tree approaches, while still maintaining a level of sparsity between nodes. the european sovereign debt has evolved in the last ten years, with some situations affecting the convergence between bond yields. after the 2008 crisis, european countries experienced a financial stress situation starting in 2010 that affected bond yields, thus the investors saw an excessive amount of sovereign debt and demanded higher interest rates in low economic growth situations and high fiscal deficit levels. during 2010-2012, several european countries suffered downgrades in their bond ratings to junk status that affected investors' trust and fears of sovereign risk contagion resulting, in some cases, a differential of over 1,000 basis points in several sovereign bonds. after the introduction of austerity measures in giips countries, the bond markets returned to normality in 2015. the 2012 european debt crisis particularly revealed spillover effects between different sovereign bonds, which have been studied using various time series models e.g. var [11, 4] and garch [6] . the results showed that portugal, greece, and ireland have a greater domestic effect, italy and spain contributed to the spillover effects to other european bond markets and a core group of abfn (austria, belgium, france, and netherlands) countries had a lower contribution to the spillover effects, with some of the least impacted countries residing outside of the euro zone. during the sovereign debt crisis, public indebtedness increased after greece had to correct the public finance falsified data, and other countries created schemes to solve their public finance problems, especially, bank bailouts. in consequence, the average debt-to-gdp ratio across the euro-zone countries rose from 72% in 2006 to 119.5% in 2014, as well as the increase in sovereign credit risk [3, 9] . after the fiscal compact treaty went into effect at the start of 2013, which defined that fiscal principles had to be embedded in the national legislation of each country that signed the treaty, the yield of sovereign bonds started a correction, although some investors and institutions . four of the listed countries are part of the g7 and g20 economic groups (germany, france, italy and the uk). we consider sovereign bond yields with a 10 year maturity between january 2010 and june 2020. this data is taken from the financial news platform 1 . in total, there are 2,491 data values for each country with an average of 240 data points within 1 year. table 2 provides summary statistics of the 10y bond yield data. the results show greek yields to have the highest values across all statistical measures compared with other countries yields, particularly within the 2010-2011 (max yield of 39.9). in contrast, swiss bond yields exhibit the smallest mean and variance, with a higher than average positive skewness compared with other sovereign bonds. under the jb test for the normality of data distributions, all bond yield trends have a negligible p-value with non-gaussian distributions. the left skewed yield distributions (except for iceland), which represent an average decrease in yield values each year are high for giips countries compared with the uk, france, and germany, with flattening yield trends. we compute the correlation matrix for each window x with a displacement of δ between windows, and consider the mean and variance for the correlation matrix. we define the mean correlation r(t) given the correlations r ij for n sovereign bonds from figure 1 , we find that the mean correlation r(t) is highest at 0.95 in oct 2019. this suggests that a covid-19 impact was a continuation on the decrease of the mean correlation, and throughout the punitive lock down measures introduced by the majority of european countries in feb-mar 2020. the decreases in mean correlation are also observed within the in the 2012 period during the european debt crisis, in which several european countries received eu-imf bailouts to cope with government debt and in 2016, under a combination of political events within the uk and the increased debt accumulation by italian banks. the variance u(t) also follows a trend similar to the mean correlation, with the smallest variance of 0.002 in october 2019. within 2020, the variance increases between sovereign bonds and reflects the differences between the correlations of low and high yield. we consider the normalized network length l(t), which is introduced in onnela, et al [36] as the normalized tree length. we define the measure as the normalized network length, as this measure is considered for ag and tmfg non-tree networks. the network length is a measure of the mean link weights on the subset of links e(t), which are present within the filtered network on the distance matrix at time-t the plots in figure 2 represent the mean and variance of the network length. as each filtering method considers a subset of weighted links, the normalized length l(t) is monotonic between all methods and decreases with the increased proportion of positive correlated links within the network. we highlight the movements in the normalized network length during the covid-19 period, which is reflected across all filtering methods. this movement is observed within 2016, but only towards a subset of correlations, in which the network length of the mast and tmfg increases compared with the mst and ag. the relative difference between the normalized networks lengths is least evident in periods of low variance; this is observed in the 2019-2020 period, where the difference between all methods decreases. we find the variance is highest within the tmfg and lowest with the ag approach. the increased inclusion of links with a higher reliability error in the tmfg increases the variance, particularly within the 2014-2017 period. the variance of the mst on average is higher compared with the mast, but when considering only the highest correlated links in the ag, the variance decreases. we define the degree centrality for the node of maximum degree c(t) at time-t. this measure considers the number of direct links the mean occupation layer η(t) (mol) introduced in onnela, et al [36] is a measure of the centrality of the network, relative to the central node υ(t). we define lev i (t) as the level of the node, which is the distance of the node relative to υ(t), where the central node and nodes unconnected relative to the central node have a level value of 0 we use the betweenness centrality to define the central node υ(t) for the mol. introduced in freeman [16] , the betweenness b(t) considers the number of shortest paths σ ij (k) between i and j which pass through the node k, relative to the total number of shortest paths σ ij between i and j, within the mst, the degree centrality ranges between 3 to 5 for euro-zone countries. the trend within the mst remains stable, where the central node under degree centrality is associated with multiple sovereign bonds e.g., netherlands 19%, portugal 10% and belgium 9% across all periods. the mast has the highest variation, with a centralized network structure in some periods e.g., c(t) of 16, forming a star shaped network structure. this is usually associated with greece, iceland and hungary, which are identified as the central node 55% of the time. the degree centrality on average is naturally highest with the tmfg, under a higher network density, where the central nodes are identified as hungary and romania sovereign bonds, similar to the mast. the ag identifies the netherlands and belgium within the degree centrality, under a higher proportion of 25% and 13% compared with the mst. within figure 3 , the mol on average is smallest for the ag, because of the 0 level values from unconnected nodes, in which an unconnected node is present within 94% of considered windows. we find that the nodes within the tmfg are closest within the network, where the central node is directly or indirectly connected for all nodes, with an average path length of 1.1 across all periods. between the mst and mast, the mol is higher within the mast, where nodes within the network have a higher degree centrality. we analyze the temporal changes of sovereign bond yields between october 2019 and june 2020. the associated link weights on each filtering method for window x are the proportions in which the link appears within the correlation matrix, under the statistical reliability, across all samples m for the randomly sampled windows x * m . under the mst, austria has the highest degree centrality of 4. the network also exhibits clusters between southern european countries connected by spain, and the uk towards polish and german sovereign bond yields. within the network, there is a connection between all abfn countries, but countries within this group also facilitate the connecting component within giips countries, where belgium is connected with spain and irish sovereign bonds. the uk and eastern european countries remain on the periphery, with abfn countries occupying the core of the network structure. for the mast in figure 4 , there exists a high degree centrality for polish sovereign bonds between western european countries e.g., france and netherlands. this contrasts to the observed regional hub structure within the mst, with the existence of several sovereign bonds with high degree centrality in the network. the uk remains within the periphery of the mast structure when considering anti-correlations, and shows uk bond yields fluctuate less with movements of other european bonds compared with previous years. this is also observed for sovereign bonds for other countries with non-euro currencies such as czech republic, hungary, and iceland. we find nodes within the tmfg to have the highest degree in iceland at 13 and poland at 10. although the mst is embedded within the tmfg network structure, a high resemblance is observed to links from the mast, where 69% of links which are present within the mast are common in both networks. there is also the associated degree centrality of the mast, which is observed within the tmfg connected nodes. under the tmfg, nodes have a higher degree connectivity when considering an increased number of links, this is the case for the uk, which has 9 links compared with other spanning tree approaches. the ag exhibits three connected components between western european countries, southern european countries and the uk with eastern european countries. these unconnected nodes within the ag are associated with non-euro adopting countries, with the remaining countries connected in an individual component. by solely considering the most positive correlations, we observe the formation of 3-cliques between countries, which is prevalent within the western european group of 6 nodes. the average statistical reliability is highest at 0.92 within the mast and ag, 0.89 for the mst and 0.82 for the tmfg. under the tmfg, the increased inclusion of links with a lower magnitude in correlations decreases the reliability in link values. other filtering approaches which consider a smaller subset can still result in low reliability values between some nodes e.g. austria and romania at 0.51 in the mst, germany and netherlands at 0.47 in ag. under various constraints, we observe a commonality between sovereign bonds across network filtering methods. we find for tree networks, that euro-area countries have a high degree centrality and countries with non-euro currencies e.g. czech republic and the uk are predominately located within the periphery of the network. this is further observed within the ag, where cliques are formed between giips and abfn countries, which is distinctive during the covid-19 period compared with previous years. the anti-correlations within the mast inform the trends of the negative correlations between eastern european countries and other european countries. by considering the tmfg with an increased number of links for positive correlations, we find similarities with the mast degree centrality. as a response to the covid-19 pandemic, most countries implemented various socio-economic policies and business restrictions almost simultaneously. an immediate consequence was an increase in yield rates for these nations. the resulting upward co-movement and upward movements in other yield rates explain the decrease in the mean correlation in bond dynamics, coinciding with the pandemic outbreak. thus, understanding the dynamics of financial instruments in the euro area is important to assess the increased economic strain from events seen in the last decade. in this paper, we consider the movements of european sovereign bond yields for network filtering methods, where we particularly focus on the covid-19 period. we find that the impact of covid-banks starts to drop off, the market dynamics could adjust to economic performance and not its financial performance. in other words, the resulting dynamics could explain an increase in mean correlation in bond dynamics coinciding with the economic dynamics after the pandemic and the increment in yield rates. although we consider the sovereign bond yields with a 10y maturity as a benchmark, this research can be extended to sovereign bonds with different maturities (e.g., short term 1y, 2y or 5y, and long term 20y or 30y) because these bonds could reveal interesting effects and confirm that sovereign bonds are a good indicator to identify the economic impact of covid-19. as each sovereign bond has different yield and volatility trends, we considered using the zero-coupon curve to evaluate the full extent of covid-19 on sovereign bonds. multiplex interbank networks and systemic importance: an application to european data clustering stock markets for balanced portfolio construction the dynamics of spillover effects during the european sovereign debt turmoil sovereign bond yield spillovers in the euro zone during the financial and debt crisis correlation structure and dynamics in volatile markets spillover effects on government bond yields in euro zone. does full financial integration exist in european government bond markets? interbank markets and multiplex networks: centrality measures and statistical null models multi-scale correlations in different futures markets the geography of the great rebalancing in euro area bond markets during the sovereign debt crisis analysis of correlation based networks representing dax 30 stock price returns measuring bilateral spillover and testing contagion on sovereign bond markets in europe the entropy as a tool for analysing statistical dependences in financial time series sovereign debt crisis in the european union: a minimum spanning tree approach spanning trees and the eurozone crisis bootstrap methods: another look at the jackknife a set of measures of centrality based on betweenness comovements in government bond markets: a minimum spanning tree analysis using multiplex networks for banking systems dynamics modelling maximal spanning trees, asset graphs and random matrix denoising in the analysis of dynamics of financial networks multifractal diffusion entropy analysis on stock volatility in financial markets dynamic correlation network analysis of financial asset returns with network clustering currency crises and the evolution of foreign exchange market: evidence from minimum spanning tree correlation of financial markets in times of crisis multi-layered interbank model for assessing systemic risk on the shortest spanning subtree of a graph and the traveling salesman problem a perspective on correlation-based financial networks and entropy measures random matrix theory and financial correlations extracting the sovereigns' cds market hierarchy: a correlation-filtering approach portfolio optimization based on network topology complex networks and minimal spanning trees in international trade network hierarchical structure in financial markets network filtering for big data: triangulated maximally filtered graph interest rates hierarchical structure relation between financial market structure and the real economy: comparison between clustering methods the multiplex dependency structure of financial markets dynamic asset trees and black monday asset trees and asset graphs in financial markets clustering and information in correlation based financial networks random matrix approach to cross correlations in financial data the multi-layer network nature of systemic risk and its implications for the costs of financial crises universal and nonuniversal allometric scaling behaviors in the visibility graphs of world stock market indices pruning a minimum spanning tree on stock market dynamics through ultrametricity of minimum spanning tree causality networks of financial assets complexities in financial network topological dynamics: modeling of emerging and developed stock markets cross-border interbank networks, banking risk and contagion a tool for filtering information in complex systems spanning trees and bootstrap reliability estimation in correlation-based networks hierarchically nested factor model from multivariate data correlation, hierarchies, and networks in financial markets a cluster driven log-volatility factor model: a deepening on the source of the volatility clustering multiscale correlation networks analysis of the us stock market: a wavelet analysis network formation in a multi-asset artificial stock market key: cord-007708-hr4smx24 authors: van kampen, antoine h. c.; moerland, perry d. title: taking bioinformatics to systems medicine date: 2015-08-13 journal: systems medicine doi: 10.1007/978-1-4939-3283-2_2 sha: doc_id: 7708 cord_uid: hr4smx24 systems medicine promotes a range of approaches and strategies to study human health and disease at a systems level with the aim of improving the overall well-being of (healthy) individuals, and preventing, diagnosing, or curing disease. in this chapter we discuss how bioinformatics critically contributes to systems medicine. first, we explain the role of bioinformatics in the management and analysis of data. in particular we show the importance of publicly available biological and clinical repositories to support systems medicine studies. second, we discuss how the integration and analysis of multiple types of omics data through integrative bioinformatics may facilitate the determination of more predictive and robust disease signatures, lead to a better understanding of (patho)physiological molecular mechanisms, and facilitate personalized medicine. third, we focus on network analysis and discuss how gene networks can be constructed from omics data and how these networks can be decomposed into smaller modules. we discuss how the resulting modules can be used to generate experimentally testable hypotheses, provide insight into disease mechanisms, and lead to predictive models. throughout, we provide several examples demonstrating how bioinformatics contributes to systems medicine and discuss future challenges in bioinformatics that need to be addressed to enable the advancement of systems medicine. systems medicine fi nds its roots in systems biology, the scientifi c discipline that aims at a systems-level understanding of, for example, biological networks, cells, organs, organisms, and populations. it generally involves a combination of wet-lab experiments and computational (bioinformatics) approaches. systems medicine extends systems biology by focusing on the application of systems-based approaches to clinically relevant applications in order to improve patient health or the overall well-being of (healthy) individuals [ 1 ] . systems medicine is expected to change health care practice in the coming years. it will contribute to new therapeutics through the identifi cation of novel disease genes that provide drug candidates less likely to fail in clinical studies [ 2 , 3 ] . it is also expected to contribute to fundamental insights into networks perturbed by disease, improved prediction of disease progression, stratifi cation of disease subtypes, personalized treatment selection, and prevention of disease. to enable systems medicine it is necessary to characterize the patient at various levels and, consequently, to collect, integrate, and analyze various types of data including not only clinical (phenotype) and molecular data, but also information about cells (e.g., disease-related alterations in organelle morphology), organs (e.g., lung impedance when studying respiratory disorders such as asthma or chronic obstructive pulmonary disease), and even social networks. the full realization of systems medicine therefore requires the integration and analysis of environmental, genetic, physiological, and molecular factors at different temporal and spatial scales, which currently is very challenging. it will require large efforts from various research communities to overcome current experimental, computational, and information management related barriers. in this chapter we show how bioinformatics is an essential part of systems medicine and discuss some of the future challenges that need to be solved. to understand the contribution of bioinformatics to systems medicine, it is helpful to consider the traditional role of bioinformatics in biomedical research, which involves basic and applied (translational) research to augment our understanding of (molecular) processes in health and disease. the term "bioinformatics" was fi rst coined by the dutch theoretical biologist paulien hogeweg in 1970 to refer to the study of information processes in biotic systems [ 4 ] . soon, the fi eld of bioinformatics expanded and bioinformatics efforts accelerated and matured as the fi rst (whole) genome and protein sequences became available. the signifi cance of bioinformatics further increased with the development of highthroughput experimental technologies that allowed wet-lab researchers to perform large-scale measurements. these include determining whole-genome sequences (and gene variants) and genome-wide gene expression with next-generation sequencing technologies (ngs; see table 1 for abbreviations and web links) [ 5 ] , measuring gene expression with dna microarrays [ 6 ] , identifying and quantifying proteins and metabolites with nmr or (lc/ gc-) ms [ 7 ] , measuring epigenetic changes such as methylation and histone modifi cations [ 8 ] , and so on. these, "omics" technologies, are capable of measuring the many molecular building blocks that determine our (patho)physiology. genome-wide measurements have not only signifi cantly advanced our fundamental understanding of the molecular biology of health and disease but table 1 abbreviations and websites have also contributed to new (commercial) diagnostic and prognostic tests [ 9 , 10 ] and the selection and development of (personalized) treatment [ 11 ] . nowadays, bioinformatics is therefore defi ned as "advancing the scientifi c understanding of living systems through computation" (iscb), or more inclusively as "conceptualizing biology in terms of molecules and applying 'informatics techniques' (derived from disciplines such as applied mathematics, computer science and statistics) to understand and organize the information associated with these molecules, on a large scale" [ 12 ] . it is worth noting that solely measuring many molecular components of a biological system does not necessarily result in a deeper understanding of such a system. understanding biological function does indeed require detailed insight into the precise function of these components but, more importantly, it requires a thorough understanding of their static, temporal, and spatial interactions. these interaction networks underlie all (patho)physiological processes, and elucidation of these networks is a major task for bioinformatics and systems medicine . the developments in experimental technologies have led to challenges that require additional expertise and new skills for biomedical researchers: • information management. modern biomedical research projects typically produce large and complex omics data sets , sometimes in the order of hundreds of gigabytes to terabytes of which a large part has become available through public databases [ 13 , 14 ] sometimes even prior to publication (e.g., gtex, icgc, tcga). this not only contributes to knowledge dissemination but also facilitates reanalysis and metaanalysis of data, evaluation of hypotheses that were not considered by the original research group, and development and evaluation of new bioinformatics methods. the use of existing data can in some cases even make new (expensive) experiments superfl uous. alternatively, one can integrate publicly available data with data generated in-house for more comprehensive analyses, or to validate results [ 15 ] . in addition, the obligation of making raw data available may prevent fraud and selective reporting. the management (transfer, storage, annotation, and integration) of data and associated meta-data is one of the main and increasing challenges in bioinformatics that needs attention to safeguard the progression of systems medicine. • data analysis and interpretation . bioinformatics data analysis and interpretation of omics data have become increasingly complex, not only due to the vast volumes and complexity of the data but also as a result of more challenging research questions. bioinformatics covers many types of analyses including nucleotide and protein sequence analysis, elucidation of tertiary protein structures, quality control, pre-processing and statistical analysis of omics data, determination of genotypephenotype relationships, biomarker identifi cation, evolutionary analysis, analysis of gene regulation, reconstruction of biological networks, text mining of literature and electronic patient records, and analysis of imaging data. in addition, bioinformatics has developed approaches to improve experimental design of omics experiments to ensure that the maximum amount of information can be extracted from the data. many of the methods developed in these areas are of direct relevance for systems medicine as exemplifi ed in this chapter. clearly, new experimental technologies have to a large extent turned biomedical research in a data-and compute-intensive endeavor. it has been argued that production of omics data has nowadays become the "easy" part of biomedical research, whereas the real challenges currently comprise information management and bioinformatics analysis. consequently, next to the wet-lab, the computer has become one of the main tools of the biomedical researcher . bioinformatics enables and advances the management and analysis of large omics-based datasets, thereby directly and indirectly contributing to systems medicine in several ways ( fig. 1 3. quality control and pre-processing of omics data. preprocessing typically involves data cleaning (e.g., removal of failed assays) and other steps to obtain quantitative measurements that can be used in downstream data analysis. 4. (statistical) data analysis methods of large and complex omicsbased datasets. this includes methods for the integrative analysis of multiple omics data types (subheading 5 ), and for the elucidation and analysis of biological networks (top-down systems medicine; subheading 6 ). systems medicine comprises top-down and bottom-up approaches. the former represents a specifi c branch of bioinformatics, which distinguishes itself from bottom-up approaches in several ways [ 3 , 19 , 20 ] . top-down approaches use omics data to obtain a holistic view of the components of a biological system and, in general, aim to construct system-wide static functional or physical interaction networks such as gene co-expression networks and protein-protein interaction networks. in contrast, bottom-up approaches aim to develop detailed mechanistic and quantitative mathematical models for sub-systems. these models describe the dynamic and nonlinear behavior of interactions between known components to understand and predict their behavior upon perturbation. however, in contrast to omics-based top-down approaches, these mechanistic models require information about chemical/physical parameters and reaction stoichiometry, which may not be available and require further (experimental) efforts. both the top-down and bottom-up approaches result in testable hypotheses and new wet-lab or in silico experiments that may lead to clinically relevant fi ndings. biomedical research and, consequently, systems medicine are increasingly confronted with the management of continuously growing volumes of molecular and clinical data, results of data analyses and in silico experiments, and mathematical models. due fig. 1 the contribution of bioinformatics ( dark grey boxes ) to systems medicine ( black box ). (omics) experiments, patients, and public repositories provide a wide range of data that is used in bioinformatics and systems medicine studies to policies of scientifi c journals and funding agencies, omics data is often made available to the research community via public databases. in addition, a wide range of databases have been developed, of which more than 1550 are currently listed in the molecular biology database collection [ 14 ] providing a rich source of biomedical information. biological repositories do not merely archive data and models but also serve a range of purposes in systems medicine as illustrated below from a few selected examples. the main repositories are hosted and maintained by the major bioinformatics institutes including ebi, ncbi, and sib that make a major part of the raw experimental omics data available through a number of primary databases including genbank [ 21 ] , geo [ 22 ] , pride [ 23 ] , and metabolights [ 24 ] for sequence, gene expression, ms-based proteomics, and ms-based metabolomics data, respectively. in addition, many secondary databases provide information derived from the processing of primary data, for example pathway databases (e.g., reactome [ 25 ] , kegg [ 26 ] ), protein sequence databases (e.g., uniprotkb [ 27 ] ), and many others. pathway databases provide an important resource to construct mathematical models used to study and further refi ne biological systems [ 28 , 29 ] . other efforts focus on establishing repositories integrating information from multiple public databases. the integration of pathway databases [ 30 -32 ] , and genome browsers that integrate genetic, omics, and other data with whole-genome sequences [ 33 , 34 ] are two examples of this. joint initiatives of the bioinformatics and systems biology communities resulted in repositories such as biomodels, which contains mathematical models of biochemical and cellular systems [ 35 ] , recon 2 that provides a communitydriven, consensus " metabolic reconstruction " of human metabolism suitable for computational modelling [ 36 ] , and seek, which provides a platform designed for the management and exchange of systems biology data and models [ 37 ] . another example of a database that may prove to be of value for systems medicine studies is malacards , an integrated and annotated compendium of about 17,000 human diseases [ 38 ] . malacards integrates 44 disease sources into disease cards and establishes gene-disease associations through integration with the well-known genecards databases [ 39 , 40 ] . integration with genecards and cross-references within malacards enables the construction of networks of related diseases revealing previously unknown interconnections among diseases, which may be used to identify drugs for off-label use. another class of repositories are (expert-curated) knowledge bases containing domain knowledge and data, which aim to provide a single point of entry for a specifi c domain. contents of these knowledge bases are often based on information extracted (either manually or by text mining) from literature or provided by domain experts [ 41 -43 ] . finally, databases are used routinely in the analysis, interpretation, and validation of experimental data. for example, the gene ontology (go) provides a controlled vocabulary of terms for describing gene products, and is often used in gene set analysis to evaluate expression patterns of groups of genes instead of those of individual genes [ 44 ] and has, for example, been applied to investigate hiv-related cognitive disorders [ 45 ] and polycystic kidney disease [ 46 ] . several repositories such as mir2disease [ 47 ] , peroxisomedb [ 41 ] , and mouse genome informatics (mgi) [ 43 ] include associations between genes and disorders, but only provide very limited phenotypic information. phenotype databases are of particular interest to systems medicine. one well-known phenotype repository is the omim database, which primarily describes single-gene (mendelian) disorders [ 48 ] . clinvar is another example and provides an archive of reports and evidence of the relationships among medically important human variations found in patient samples and phenotypes [ 49 ] . clinvar complements dbsnp (for singlenucleotide polymorphisms) [ 50 ] and dbvar (for structural variations) [ 51 ] , which both provide only minimal phenotypic information. the integration of these phenotype repositories with genetic and other molecular information will be a major aim for bioinformatics in the coming decade enabling, for example, the identifi cation of comorbidities, determination of associations between gene (mutations) and disease, and improvement of disease classifi cations [ 52 ] . it will also advance the defi nition of the "human phenome," i.e., the set of phenotypes resulting from genetic variation in the human genome. to increase the quality and (clinical) utility of the phenotype and variant databases as an essential step towards reducing the burden of human genetic disease, the human variome project coordinates efforts in standardization, system development, and (training) infrastructure for the worldwide collection and sharing of genetic variations that affect human health [ 53 , 54 ] . to implement and advance systems medicine to the benefi t of patients' health, it is crucial to integrate and analyze molecular data together with de-identifi ed individual-level clinical data complementing general phenotype descriptions. patient clinical data refers to a wide variety of data including basic patient information (e.g., age, sex, ethnicity), outcomes of physical examinations, patient history, medical diagnoses, treatments, laboratory tests, pathology reports, medical images, and other clinical outcomes. inclusion of clinical data allows the stratifi cation of patient groups into more homogeneous clinical subgroups. availability of clinical data will increase the power of downstream data analysis and modeling to elucidate molecular mechanisms, and to identify molecular biomarkers that predict disease onset or progression, or which guide treatment selection. in biomedical studies clinical information is generally used as part of patient and sample selection, but some omics studies also use clinical data as part of the bioinformatics analysis (e.g., [ 9 , 55 ] ). however, in general, clinical data is unavailable from public resources or only provided on an aggregated level. although good reasons exist for making clinical data available (subheading 2.2 ), ethical and legal issues comprising patient and commercial confi dentiality, and technical issues are the most immediate challenges [ 56 , 57 ] . this potentially hampers the development of systems medicine approaches in a clinical setting since sharing and integration of clinical and nonclinical data is considered a basic requirement [ 1 ] . biobanks [ 58 ] such as bbmri [ 59 ] provide a potential source of biological material and associated (clinical) data but these are, generally, not publicly accessible, although permission to access data may be requested from the biobank provider. clinical trials provide another source of clinical data for systems medicine studies, but these are generally owned by a research group or sponsor and not freely available [ 60 ] although ongoing discussions may change this in the future ( [ 61 ] and references therein). although clinical data is not yet available on a large scale, the bioinformatics and medical informatics communities have been very active in establishing repositories that provide clinical data. one example is the database of genotypes and phenotypes (dbgap) [ 62 ] developed by the ncbi. study metadata, summarylevel (phenotype) data, and documents related to studies are publicly available. access to de-identifi ed individual-level (clinical) data is only granted after approval by an nih data access committee. another example is the cancer genome atlas (tcga) , which also provides individual-level molecular and clinical data through its own portal and the cancer genomics hub (cghub). clinical data from tcga is available without any restrictions but part of the lower level sequencing and microarray data can only be obtained through a formal request managed by dbgap. medical patient records provide an even richer source of phenotypic information , and has already been used to stratify patient groups, discover disease relations and comorbidity, and integrate these records with molecular data to obtain a systems-level view of phenotypes (for a review see [ 63 ] ). on the one hand, this integration facilitates refi nement and analysis of the human phenome to, for example, identify diseases that are clinically uniform but have different underlying molecular mechanisms, or which share a pathogenetic mechanism but with different genetic cause [ 64 ] . on the other hand, using the same data, a phenome-wide association study ( phewas ) [ 65 ] would allow the identifi cation of unrelated phenotypes associated with specifi c shared genetic variant(s), an effect referred to as pleiotropy. moreover, it makes use of information from medical records generated in routine clinical practice and, consequently, has the potential to strengthen the link between biomedical research and clinical practice [ 66 ] . the power of phenome analysis was demonstrated in a study involving 1.5 million patient records, not including genotype information, comprising 161 disorders. in this study it was shown that disease phenotypes form a highly connected network suggesting a shared genetic basis [ 67 ] . indeed, later studies that incorporated genetic data resulted in similar fi ndings and confi rmed a shared genetic basis for a number of different phenotypes. for example, a recent study identifi ed 63 potentially pleiotropic associations through the analysis of 3144 snps that had previously been implicated by genome-wide association studies ( gwas) as mediators of human traits, and 1358 phenotypes derived from patient records of 13,835 individuals [ 68 ] . this demonstrates that phenotypic information extracted manually or through text mining from patient records can help to more precisely defi ne (relations between) diseases. another example comprises the text mining of psychiatric patient records to discover disease correlations [ 52 ] . here, mapping of disease genes from the omim database to information from medical records resulted in protein networks suspected to be involved in psychiatric diseases. integrative bioinformatics comprises the integrative (statistical) analysis of multiple omics data types. many studies demonstrated that using a single omics technology to measure a specifi c molecular level (e.g., dna variation, expression of genes and proteins, metabolite concentrations, epigenetic modifi cations) already provides a wealth of information that can be used for unraveling molecular mechanisms underlying disease. moreover, single-omics disease signatures which combine multiple (e.g., gene expression) markers have been constructed to differentiate between disease subtypes to support diagnosis and prognosis. however, no single technology can reveal the full complexity and details of molecular networks observed in health and disease due to the many interactions across these levels. a systems medicine strategy should ideally aim to understand the functioning of the different levels as a whole by integrating different types of omics data. this is expected to lead to biomarkers with higher predictive value, and novel disease insights that may help to prevent disease and to develop new therapeutic approaches. integrative bioinformatics can also facilitate the prioritization and characterization of genetic variants associated with complex human diseases and traits identifi ed by gwas in which hundreds of thousands to over a million snps are assayed in a large number of individuals. although such studies lack the statistical power to identify all disease-associated loci [ 69 ] , they have been instrumental in identifying loci for many common diseases. however, it remains diffi cult to prioritize the identifi ed variants and to elucidate their effect on downstream pathways ultimately leading to disease [ 70 ] . consequently, methods have been developed to prioritize candidate snps based on integration with other (omics) data such as gene expression, dnase hypersensitive sites, histone modifi cations, and transcription factor-binding sites [ 71 ] . the integration of multiple omics data types is far from trivial and various approaches have been proposed [ 72 -74 ] . one approach is to link different types of omics measurements through common database identifi ers. although this may seem straightforward, in practice this is complicated as a result of technical and standardization issues as well as a lack of biological consensus [ 32 , 75 -77 ] . moreover, the integration of data at the level of the central dogma of molecular biology and, for example, metabolite data is even more challenging due to the indirect relationships between genes, transcripts, and proteins on the one hand and metabolites on the other hand, precluding direct links between the database identifi ers of these molecules. statistical data integration [ 72 ] is a second commonly applied strategy, and various approaches have been applied for the joint analysis of multiple data types (e.g., [ 78 , 79 ] ). one example of statistical data integration is provided by a tcga study that measured various types of omics data to characterize breast cancer [ 80 ] . in this study 466 breast cancer samples were subjected to whole-genome and -exome sequencing, and snp arrays to obtain information about somatic mutations, copy number variations, and chromosomal rearrangements. microarrays and rna-seq were used to determine mrna and microrna expression levels, respectively. reverse-phase protein arrays (rppa) and dna methylation arrays were used to obtain data on protein expression levels and dna methylation, respectively. simultaneous statistical analysis of different data types via a "cluster-of-clusters" approach using consensus clustering on a multi-omics data matrix revealed that four major breast cancer subtypes could be identifi ed. this showed that the intrinsic subtypes (basal, luminal a and b, her2) that had previously been determined using gene expression data only could be largely confi rmed in an integrated analysis of a large number of breast tumors. single-level omics data has extensively been used to identify disease-associated biomarkers such as genes, proteins, and metabolites. in fact, these studies led to more than 150,000 papers documenting thousands of claimed biomarkers, however, it is estimated that fewer than 100 of these are currently used for routine clinical practice [ 81 ] . integration of multiple omics data types is expected to result in more robust and predictive disease profi les since these better refl ect disease biology [ 82 ] . further improvement of these profi les may be obtained through the explicit incorporation of interrelationships between various types of measurements such as microrna-mrna target, or gene methylation-microrna (based on a common target gene). this was demonstrated for the prediction of short-term and long-term survival from serous cystadenocarcinoma tcga data [ 83 ] . according to the recent casym roadmap : "human disease can be perceived as perturbations of complex, integrated genetic, molecular and cellular networks and such complexity necessitates a new approach." [ 84 ] . in this section we discuss how (approximations) to these networks can be constructed from omics data and how these networks can be decomposed in smaller modules. then we discuss how the resulting modules can be used to generate experimentally testable hypotheses, provide insight into disease mechanisms, lead to predictive diagnostic and prognostic models, and help to further subclassify diseases [ 55 , 85 ] (fig. 2 ) network-based approaches will provide medical doctors with molecular level support to make personalized treatment decisions. in a top-down approach the aim of network reconstruction is to infer the connections between the molecules that constitute a biological network. network models can be created using a variety of mathematical and statistical techniques and data types. early approaches for network inference (also called reverse engineering ) used only gene expression data to reconstruct gene networks. here, we discern three types of gene network inference algorithms using methods based on (1) correlation-based approaches, (2) information-theoretic approaches, and (3) bayesian networks [ 86 ] . co-expression networks are an extension of commonly used clustering techniques , in which genes are connected by edges in a network if the amount of correlation of their gene expression profi les exceeds a certain value. co-expression networks have been shown to connect functionally related genes [ 87 ] . note that connections in a co-expression network correspond to either direct (e.g., transcription factor-gene and protein-protein) or indirect (e.g., proteins participating in the same pathway) interactions. in one of the earliest examples of this approach, pair-wise correlations were calculated between gene expression profi les and the level of growth inhibition caused by thousands of tested anticancer agents, for 60 cancer cell lines [ 88 ] . removal of associations weaker than a certain threshold value resulted in networks consisting of highly correlated genes and agents, called relevance networks, which led to targeted hypotheses for potential single-gene determinants of chemotherapeutic susceptibility. information-theoretic approaches have been proposed in order to capture nonlinear dependencies assumed to be present in most biological systems and that cannot be captured by correlation-based distance measures . these approaches often use the concept of mutual information, a generalization of the correlation coeffi cient which quantifi es the degree of statistical (in)dependence. an example of a network inference method that is based on mutual information is aracne, which has been used to reconstruct the human b-cell gene network from a large compendium of human b-cell gene expression profi les [ 89 ] . in order to discover regulatory interactions, aracne removes the majority of putative indirect interactions from the initial mutual information-based gene network using a theorem from information theory, the data processing inequality. this led to the identifi cation of myc as a major hub in the b-cell gene network and a number of novel myc target genes, which were experimentally validated. whether informationtheoretic approaches are more powerful in general than correlationbased approaches is still subject of debate [ 90 ] . bayesian networks allow the description of statistical dependencies between variables in a generic way [ 91 , 92 ] . bayesian networks are directed acyclic networks in which the edges of the network represent conditional dependencies; that is, nodes that are not connected represent variables that are conditionally independent of each other. a major bottleneck in the reconstruction of bayesian networks is their computational complexity. moreover, bayesian networks are acyclic and cannot capture feedback loops that characterize many biological networks. when time-series rather than steady-state data is available, dynamic bayesian networks provide a richer framework in which cyclic networks can be reconstructed [ 93 ] . gene (co-)expression data only offers a partial view on the full complexity of cellular networks. consequently, networks have also been constructed from other types of high-throughput data. for example, physical protein-protein interactions have been measured on a large scale in different organisms including human, using affi nity capture-mass spectrometry or yeast two-hybrid screens, and have been made available in public databases such as biogrid [ 94 ] . regulatory interactions have been probed using chromatin immunoprecipitation sequencing (chip-seq) experiments, for example by the encode consortium [ 95 ] . using probabilistic techniques , heterogeneous types of experimental evidence and prior knowledge have been integrated to construct functional association networks for human [ 96 ] , mouse [ 97 ] , and, most comprehensively, more than 1100 organisms in the string database [ 98 ] . functional association networks can help predict novel pathway components, generate hypotheses for biological functions for a protein of interest, or identify disease-related genes [ 97 ] . prior knowledge required for these approaches is, for example, available in curated biological pathway databases, and via protein associations predicted using text mining based on their cooccurrence in abstracts or even full-text articles. many more integrative network inference methods have been proposed; for a review see [ 99 ] . the integration of gene expression data with chip data [ 100 ] or transcription factor-binding motif data [ 101 ] has shown to be particularly fruitful for inferring transcriptional regulatory networks. recently, li et al. [ 102 ] described the results from a regression-based model that predicts gene expression using encode (chip-seq) and tcga data (mrna expression data complemented with copy number variation, dna methylation, and microrna expression data). this model infers the regulatory activities of expression regulators and their target genes in acute myeloid leukemia samples. eighteen key regulators were identifi ed, whose activities clustered consistently with cytogenetic risk groups. bayesian networks have also been used to integrate multiomics data. the combination of genotypic and gene expression data is particularly powerful, since dna variations represent naturally occurring perturbations that affect gene expression detected as expression quantitative trait loci ( eqtl ). cis -acting eqtls can then be used as constraints in the construction of directed bayesian networks to infer causal relationships between nodes in the network [ 103 ] . large multi-omics datasets consisting of hundreds or sometimes even thousands of samples are available for many commonly occurring human diseases, such as most tumor types (tcga), alzheimer's disease [ 104 ] , and obesity [ 105 ] . however, a major bottleneck for the construction of accurate gene networks is that the number of gene networks that are compatible with the experimental data is several orders of magnitude larger still. in other words, top-down network inference is an underdetermined problem with many possible solutions that explain the data equally well and individual gene-gene interactions are characterized by a high false-positive rate [ 99 ] . most network inference methods therefore try to constrain the number of possible solutions by making certain assumptions about the structure of the network. perhaps the most commonly used strategy to harness the complexity of the gene network inference problem is to analyze experimental data in terms of biological modules, that is, sets of genes that have strong interactions and a common function [ 106 ] . there is considerable evidence that many biological networks are modular [ 107 ] . module-based approaches effectively constrain the number of parameters to estimate and are in general also more robust to the noise that characterizes high-throughput omics measurements. a detailed review of module-based techniques is outside the scope of this chapter (see, for example [ 108 ] ), but we would like to mention a few examples of successful and commonly used modular approaches. weighted gene co-expression network analysis ( wgcna) decomposes a co-expression network into modules using clustering techniques [ 109 ] . modules can be summarized by their module eigengene, a weighted average expression profi le of all gene member of a given module. eigengenes can then be correlated with external sample traits to identify modules that are related with these traits. parikshak et al. [ 110 ] used wgcna to extract modules from a co-expression network constructed using fetal and early postnatal brain development expression data. next, they established that several of these modules were enriched for genes and rare de novo variants implicated in autism spectrum disorder (asd). moreover, the asd-associated modules are also linked at the transcriptional level and 17 transcription factors were found acting as putative co-regulators of asd-associated gene modules during neocortical development. wgcna can also be used when multiple omics data types are available. one example of such an approach involved the integration of transcriptomic and proteomic data from a study investigating the response to sars-cov infection in mice [ 111 ] . in this study wgcna-based gene and protein co-expression modules were constructed and integrated to obtain module-based disease signatures. interestingly, the authors found several cases of identifi er-matched transcripts and proteins that correlated well with the phenotype, but which showed poor or anticorrelation across these two data types. moreover, the highest correlating transcripts and peptides were not the most central ones in the co-expression modules. vice versa , the transcripts and proteins that defi ned the modules were not those with the highest correlation to the phenotype. at the very least this shows that integration of omics data affects the nature of the disease signatures. identifi cation of active modules is another important integrative modular technique . here, experimental data in the form of molecular profi les is projected onto a biological network, for example a protein-protein interaction network. active modules are those subnetworks that show the largest change in expression for a subset of conditions and are likely to contain key drivers or regulators of those processes perturbed in the experiment. active modules have, for example, been used to fi nd a subnetwork that is overexpressed in a particularly aggressive lymphoma subtype [ 112 ] and to detect signifi cantly mutated pathways [ 113 ] . some active module approaches integrate various types of omics data. one example of such an approach is paradigm [ 114 ] , which translates pathways into factor graphs, a class of models that belongs to the same family of models as bayesian networks, and determines sample-specifi c pathway activity from multiple functional genomic datasets. paradigm has been used in several tcga projects, for example, in the integrated analysis of 131 urothelial bladder carcinomas [ 55 ] . paradigm-based analysis of copy number variations and rna-seq gene expression in combination with a propagation-based network analysis algorithm revealed novel associations between mutations and gene expression levels, which subsequently resulted in the identifi cation of pathways altered in bladder cancer. the identifi cation of activating or inhibiting gene mutations in these pathways suggested new targets for treatment. moreover, this effort clearly showed the benefi ts of screening patients for the presence of specifi c mutations to enable personalized treatment strategies. often, published disease signatures cannot be replicated [ 81 ] or provide hardly additional biological insight. also here (modular) network-based approaches have been proposed to alleviate these problems. a common characteristic of most methods is that the molecular activity of a set of genes is summarized on a per sample basis. summarized gene set scores are then used as features in prognostic and predictive models. relevant gene sets can be based on prior knowledge and correspond to canonical pathways, gene ontology categories, or sets of genes sharing common motifs in their promoter regions [ 115 ] . gene set scores can also be determined by projecting molecular data onto a biological network and summarizing scores at the level of subnetworks for each individual sample [ 116 ] . while promising in principle, it is still subject of debate whether gene set-based models outperform gene-based one s [ 117 ] . the comparative analysis of networks across different species is another commonly used approach to constrain the solution space. patterns conserved across species have been shown to be more likely to be true functional interactions [ 107 ] and to harbor useful candidates for human disease genes [ 118 ] . many network alignment methods have been developed in the past decade to identify commonalities between networks. these methods in general combine sequence-based and topological constraints to determine the optimal alignment of two (or more) biological networks. network alignment has, for example, been applied to detect conserved patterns of protein interaction in multiple species [ 107 , 119 ] and to analyze the evolution of co-expression networks between humans and mice [ 120 , 121 ] . network alignment can also be applied to detect diverged patterns [ 120 ] and may thus lead to a better understanding of similarities and differences between animal models and human in health and disease. information from model organisms has also been fruitfully used to identify more robust disease signatures [ 122 -125 ] . sweet-cordero and co-workers [ 122 ] used a gene signature identifi ed in a mouse model of lung adenocarcinoma to uncover an orthologous signature in human lung adenocarcinoma that was not otherwise apparent. bild et al. [ 123 ] defi ned gene expression signatures characterizing several oncogenic pathways of human mammary epithelial cells. they showed that these signatures predicted pathway activity in mouse and human tumors. predictions of pathway activity correlated well with the sensitivity to drugs targeting those pathways and could thus serve as a guide to targeted therapies. a generic approach, pathprint, for the integration of gene expression data across different platforms and species at the level of pathways, networks, and transcriptionally regulated targets was recently described [ 126 ] . the authors used their method to identify four stem cell-related pathways conserved between human and mouse in acute myeloid leukemia, with good prognostic value in four independent clinical studies. we reviewed a wide array of different approaches showing how networks can be used to elucidate integrated genetic, molecular, and cellular networks. however, in general no single approach will be suffi cient and combining different approaches in more complex analysis pipelines will be required. this is fi ttingly illustrated by the diggit (driver-gene inference by genetical-genomics and information theory) algorithm [ 127 ] . in brief, diggit identities candidate master regulators from an aracne gene co-expression network integrated with copy number variations that affect gene expression. this method combines several previously developed computational approaches and was used to identify causal genetic drivers of human disease in general and glioblastoma, breast cancer, and alzheimer's disease in particular. this enabled identifi cation of klhl9 deletions as upstream activators of two previously established master regulators in a specifi c subtype of glioblastoma. systems medicine is one of the steps necessary to make improvements in the prevention and treatment of disease through systems approaches that will (a) elucidate (patho)physiologic mechanisms in much greater detail than currently possible, (b) produce more robust and predictive disease signatures, and (c) enable personalized treatment. in this context, we have shown that bioinformatics has a major role to play. bioinformatics will continue its role in the development, curation, integration, and maintenance of (public) biological and clinical databases to support biomedical research and systems medicine. the bioinformatics community will strengthen its activities in various standardization and curation efforts that already resulted in minimum reporting guidelines [ 128 ] , data capture approaches [ 75 ] , data exchange formats [ 129 ] , and terminology standards for annotation [ 130 ] . one challenge for the future is to remove errors and inconsistencies in data and annotation from databases and prevent new ones from being introduced [ 32 , 76 , 131 -135 ]. an equally important challenge is to establish, improve, and integrate resources containing phenotype and clinical information. to achieve this objective it seems reasonable that bioinformatics and health informatics professionals team up [ 136 -138 ] . traditionally health informatics professionals have focused on hospital information systems (e.g., patient records, pathology reports, medical images) and data exchange standards (e.g., hl7), medical terminology standards (e.g., international classifi cation of disease (icd), snomed), medical image analysis, analysis of clinical data, clinical decision support systems, and so on. while, on the other hand, bioinformatics mainly focused on molecular data, it shares many approaches and methods with health informatics. integration of these disciplines is therefore expected to benefi t systems medicine in various ways [ 139 ] . integrative bioinformatics approaches clearly have added value for systems medicine as they provide a better understanding of biological systems, result in more robust disease markers, and prevent (biological) bias that would possibly occur from using single-omics measurements. however, such studies, and the scientifi c community in general, would benefi t from improved strategies to disseminate and share data which typically will be produced at multiple research centers (e.g., https://www.synapse.org ; [ 140 ] ). integrative studies are expected to increasingly facilitate personalized medicine approaches such as demonstrated by chen and coworkers [ 141 ] . in their study they presented a 14-month "integrative personal omics profi le" (ipop) for a single individual comprising genomic, transcriptomic, proteomic, metabolomic, and autoantibody data. from the whole-genome sequence data an elevated risk for type 2 diabetes (t2d) was detected, and subsequent monitoring of hba1c and glucose levels revealed the onset of t2d, despite the fact that the individual lacked many of the known non-genetic risk factors. subsequent treatment resulted in a gradual return to the normal phenotype. this shows that the genome sequence can be used to determine disease risk in a healthy individual and allows selecting and monitoring specifi c markers that provide information about the actual disease status. network-based approaches will increasingly be used to determine the genetic causes of human diseases. since the effect of a genetic variation is often tissue or cell-type specifi c, a large effort is needed in constructing cell-type-specifi c networks both in health and disease. this can be done using data already available, an approach taken by guan et al. [ 142 ] . the authors proposed 107 tissue-specifi c networks in mouse via their generic approach for constructing functional association networks using lowthroughput, highly reliable tissue-specifi c gene expression information as a constraint. one could also generate new datasets to facilitate the construction of tissue-specifi c networks. examples of such approaches are tcga and the genotype-tissue expression (gtex) project. the aim of gtex is to create a data resource for the systematic study of genetic variation and its effect on gene expression in more than 40 human tissues [ 143 ] . regardless of the way how networks are constructed, it will become more and more important to offer a centralized repository where networks from different cell types and diseases can be stored and accessed. nowadays, these networks are diffi cult to retrieve and are scattered in supplementary fi les with the original papers, links to accompanying web pages, or even not available at all. a resource similar to what the systems biology community has created with the biomodels database would be a great leap forward. there have been some initial attempts in building databases of network models, for example the cellcircuits database [ 123 ] ( http://www.cellcircuits.org ) and the causal biological networks (cbn) database of networks related to lung disease [ 144 ] ( http://causalbionet.com ). however, these are only small-scale initiatives and a much larger and coordinated effort is required. another main bottleneck for the successful application of network inference methods is their validation. most network inference methods to date have been applied to one or a few isolated datasets and were validated using some limited follow-up experiments, for example via gene knockdowns, using prior knowledge from databases and literature as a gold standard, or by generating simulated data from a mathematical model of the underlying network [ 145 , 146 ] . however, strengths and weaknesses of network inference methods across cell types, diseases, and species have hardly been assessed. notable exceptions are collaborative competitions such as the dialogue on reverse engineering assessment and methods (dream) [ 147 ] and industrial methodology for process verifi cation (improver) [ 146 ] . these centralized initiatives propose challenges in which individual research groups can participate and to which they can submit their predictions, which can then be independently validated by the challenge organizers. several dream challenges in the area of network inference have been organized, leading to a better insight into the strengths and weaknesses of individual methods [ 148 ] . another important contribution of dream is that a crowd-based approach integrating predictions from multiple network inference methods was shown to give good and robust performance across diverse data sets [ 149 ] . also in the area of systems medicine challenge-based competitions may offer a framework for independent verifi cation of model predictions. systems medicine promises a more personalized medicine that effectively exploits the growing amount of molecular and clinical data available for individual patients. solid bioinformatics approaches are of crucial importance for the success of systems medicine. however, really delivering the promises of systems medicine will require an overall change of research approach that transcends the current reductionist approach and results in a tighter integration of clinical, wet-lab laboratory, and computational groups adopting a systems-based approach. past, current, and future success of systems medicine will accelerate this change. the road from systems biology to systems medicine participatory medicine: a driving force for revolutionizing healthcare understanding drugs and diseases by systems biology the roots of bioinformatics in theoretical biology sequencing technologies -the next generation exploring the new world of the genome with dna microarrays spectroscopic and statistical techniques for information recovery in metabonomics and metabolomics next-generation technologies and data analytical approaches for epigenomics gene expression profi ling predicts clinical outcome of breast cancer diagnostic tests based on gene expression profi le in breast cancer: from background to clinical use a multigene assay to predict recurrence of tamoxifentreated, node-negative breast cancer what is bioinformatics? a proposed defi nition and overview of the fi eld the importance of biological databases in biological discovery the 2014 nucleic acids research database issue and an updated nar online molecular biology database collection reuse of public genome-wide gene expression data experimental design for gene expression microarrays learning from our gwas mistakes: from experimental design to scientifi c method effi cient experimental design and analysis strategies for the detection of differential expression using rna-sequencing impact of yeast systems biology on industrial biotechnology the nature of systems biology gene expression omnibus: microarray data storage, submission, retrieval, and analysis the proteomics identifi cations (pride) database and associated tools: status in 2013 metabolights--an open-access generalpurpose repository for metabolomics studies and associated meta-data the reactome pathway knowledgebase data, information, knowledge and principle: back to metabolism in kegg activities at the universal protein resource (uniprot) path2models: large-scale generation of computational models from biochemical pathway maps precise generation of systems biology models from kegg pathways pathguide: a pathway resource list pathway commons, a web resource for biological pathway data consensus and confl ict cards for metabolic pathway databases the ucsc genome browser database: 2014 update biomodels database: a repository of mathematical models of biological processes a community-driven global reconstruction of human metabolism the seek: a platform for sharing data and models in systems biology malacards: an integrated compendium for diseases and their annotation genecards version 3: the human gene integrator in-silico human genomics with genecards peroxisomedb 2.0: an integrative view of the global peroxisomal metabolome the mouse age phenome knowledgebase and disease-specifi c inter-species age mapping searching the mouse genome informatics (mgi) resources for information on mouse biology from genotype to phenotype gene-set approach for expression pattern analysis systems analysis of human brain gene expression: mechanisms for hiv-associated neurocognitive impairment and common pathways with alzheimer's disease systems biology approach to identify transcriptome reprogramming and candidate microrna targets during the progression of polycystic kidney disease mir2disease: a manually curated database for microrna deregulation in human disease a new face and new challenges for online mendelian inheritance in man (omim(r)) clinvar: public archive of relationships among sequence variation and human phenotype searching ncbi's dbsnp database dbvar and dgva: public archives for genomic structural variation using electronic patient records to discover disease correlations and stratify patient cohorts on not reinventing the wheel beyond the genomics blueprint: the 4th human variome project meeting comprehensive molecular characterization of urothelial bladder carcinoma open clinical trial data for all? a view from regulators clinical trial data as a public good biobanking for europe whose data set is it anyway? sharing raw data from randomized trials sharing individual participant data from clinical trials: an opinion survey regarding the establishment of a central repository ncbi's database of genotypes and phenotypes: dbgap mining electronic health records: towards better research applications and clinical care phenome connections phewas: demonstrating the feasibility of a phenome-wide scan to discover genedisease associations mining the ultimate phenome repository probing genetic overlap among complex human phenotypes systematic comparison of phenomewide association study of electronic medical record data and genome-wide association study data finding the missing heritability of complex diseases systems genetics: from gwas to disease pathways a review of post-gwas prioritization approaches when one and one gives more than two: challenges and opportunities of integrative omics the model organism as a system: integrating 'omics' data sets principles and methods of integrative genomic analyses in cancer toward interoperable bioscience data critical assessment of human metabolic pathway databases: a stepping stone for future integration the bridgedb framework: standardized access to gene, protein and metabolite identifi er mapping services integration of transcriptomics and metabonomics: improving diagnostics, biomarker identifi cation and phenotyping in ulcerative colitis a multivariate approach to the integration of multi-omics datasets comprehensive molecular portraits of human breast tumours bring on the biomarkers assessing the clinical utility of cancer genomic and proteomic data across tumor types incorporating inter-relationships between different levels of genomic data into cancer clinical outcome prediction the casym roadmap: implementation of systems medicine across europe molecular classifi cation of cancer: class discovery and class prediction by gene expression monitoring how to infer gene networks from expression profi les coexpression analysis of human genes across many microarray data sets discovering functional relationships between rna expression and chemotherapeutic susceptibility using relevance networks reverse engineering of regulatory networks in human b cells comparison of co-expression measures: mutual information, correlation, and model based indices using bayesian networks to analyze expression data probabilistic graphical models: principles and techniques. adaptive computation and machine learning inferring gene networks from time series microarray data using dynamic bayesian networks the biogrid interaction database: 2013 update architecture of the human regulatory network derived from encode data reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes a genomewide functional network for the laboratory mouse string v9.1: protein-protein interaction networks, with increased coverage and integration advantages and limitations of current network inference methods computational discovery of gene modules and regulatory networks a semisupervised method for predicting transcription factor-gene interactions in escherichia coli regression analysis of combined gene expression regulation in acute myeloid leukemia integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks integrated systems approach identifi es genetic nodes and networks in late-onset alzheimer's disease a survey of the genetics of stomach, liver, and adipose gene expression from a morbidly obese cohort an introduction to systems biology: design principles of biological circuits from signatures to models: understanding cancer using microarrays integrative approaches for fi nding modular structure in biological networks weighted gene coexpression network analysis: state of the art integrative functional genomic analyses implicate specifi c molecular pathways and circuits in autism multi-omic network signatures of disease identifying functional modules in protein-protein interaction networks: an integrated exact approach algorithms for detecting signifi cantly mutated pathways in cancer inference of patient-specifi c pathway activities from multi-dimensional cancer genomics data using paradigm pathway-based personalized analysis of cancer network-based classifi cation of breast cancer metastasis current composite-feature classifi cation methods do not outperform simple singlegenes classifi ers in breast cancer prognosis prediction of human disease genes by humanmouse conserved coexpression analysis a comparison of algorithms for the pairwise alignment of biological networks cross-species analysis of biological networks by bayesian alignment graphalignment: bayesian pairwise alignment of biological networks an oncogenic kras2 expression signature identifi ed by cross-species gene-expression analysis oncogenic pathway signatures in human cancers as a guide to targeted therapies interspecies translation of disease networks increases robustness and predictive accuracy integrated cross-species transcriptional network analysis of metastatic susceptibility pathprinting: an integrative approach to understand the functional basis of disease identifi cation of causal genetic drivers of human disease through systems-level analysis of regulatory networks promoting coherent minimum reporting guidelines for biological and biomedical investigations: the mibbi project data standards for omics data: the basis of data sharing and reuse biomedical ontologies: a functional perspective pdb improvement starts with data deposition what we do not know about sequence analysis and sequence databases annotation error in public databases: misannotation of molecular function in enzyme superfamilies improving the description of metabolic networks: the tca cycle as example more than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology biomedical and health informatics in translational medicine amia board white paper: defi nition of biomedical informatics and specifi cation of core competencies for graduate education in the discipline synergy between medical informatics and bioinformatics: facilitating genomic medicine for future health care elixir: a distributed infrastructure for european biological data enabling transparent and collaborative computational analysis of 12 tumor types within the cancer genome atlas personal omics profi ling reveals dynamic molecular and medical phenotypes tissue-specifi c functional networks for prioritizing phenotype and disease genes the genotype-tissue expression (gtex) project on crowd-verifi cation of biological networks inference and validation of predictive gene networks from biomedical literature and gene expression data verifi cation of systems biology research in the age of collaborative competition dialogue on reverse-engineering assessment and methods: the dream of highthroughput pathway inference revealing strengths and weaknesses of methods for gene network inference wisdom of crowds for robust gene network inference we would like to thank dr. aldo jongejan for his comments that improved the text. key: cord-003887-4grjr0h3 authors: mcclure, ryan s.; wendler, jason p.; adkins, joshua n.; swanstrom, jesica; baric, ralph; kaiser, brooke l. deatherage; oxford, kristie l.; waters, katrina m.; mcdermott, jason e. title: unified feature association networks through integration of transcriptomic and proteomic data date: 2019-09-17 journal: plos comput biol doi: 10.1371/journal.pcbi.1007241 sha: doc_id: 3887 cord_uid: 4grjr0h3 high-throughput multi-omics studies and corresponding network analyses of multi-omic data have rapidly expanded their impact over the last 10 years. as biological features of different types (e.g. transcripts, proteins, metabolites) interact within cellular systems, the greatest amount of knowledge can be gained from networks that incorporate multiple types of -omic data. however, biological and technical sources of variation diminish the ability to detect cross-type associations, yielding networks dominated by communities comprised of nodes of the same type. we describe here network building methods that can maximize edges between nodes of different data types leading to integrated networks, networks that have a large number of edges that link nodes of different–omic types (transcripts, proteins, lipids etc). we systematically rank several network inference methods and demonstrate that, in many cases, using a random forest method, genie3, produces the most integrated networks. this increase in integration does not come at the cost of accuracy as genie3 produces networks of approximately the same quality as the other network inference methods tested here. using genie3, we also infer networks representing antibody-mediated dengue virus cell invasion and receptor-mediated dengue virus invasion. a number of functional pathways showed centrality differences between the two networks including genes responding to both gm-csf and il-4, which had a higher centrality value in an antibody-mediated vs. receptor-mediated dengue network. because a biological system involves the interplay of many different types of molecules, incorporating multiple data types into networks will improve their use as models of biological systems. the methods explored here are some of the first to specifically highlight and address the challenges associated with how such multi-omic networks can be assembled and how the greatest number of interactions can be inferred from different data types. the resulting networks can lead to the discovery of new host response patterns and interactions during viral infection, generate new hypotheses of pathogenic mechanisms and confirm mechanisms of disease. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 over the last several decades, a number of multi-omic experimental methods have emerged to study host-pathogen interactions [1] and other complex biological systems. these studies focus on collecting a particular type of molecule from a biological system and using high throughput methods to query abundance levels across changing conditions. these empirical methods include transcriptomics [2] [3] [4] [5] , proteomics [6] [7] [8] , metabolomics [9, 10] , and lipidomics [11] [12] [13] . host-pathogen interactions, including immunological responses and virulence factors, represent a particularly complex biological site where multiple different types of biomolecules are known to play important roles. therefore, some of the most critical insights and translational conclusions can be gained from combining different types of -omics-based studies. kocharunchitt et. al. examined the response of escherichia coli o157:h7 to varying water temperatures, emulating host interactions, by collecting proteomic and transcriptomic data and using both to query responses of a single parent gene [14] . dapat and oshitani linked host and viral proteins of respiratory syncytial virus based on creating networks of protein-protein interactions from pre-existing databases and then overlaying abundance data from transcriptomic and proteomic studies onto these networks to identify hubs that may alter their expression [15] . other studies have used transcriptomics, proteomics, and lipidomics in integrated networks to elucidate interactions between hepatitis c virus and the host [16, 17] or to better understand bacterial virulence programs [18, 19] . several multi-omic studies have reported on the fact that, globally, there seems to be poor correlation between different data types (transcriptomic or proteomic for example) that are derived from the same gene [20] [21] [22] . although expression heterogeneity (batch effects) can be an important driver of poor cross-class correlation, this lack of correlation is also likely due to the regulatory processes that determine abundance of each transcript and protein (e.g. transcription, translation, degradation, etc.). lack of correlation may also be linked to the inherent differences in querying protein levels (2d gel electrophoresis and mass spectrometry) compared to transcript levels (microarray or rnaseq) or in other aspects of experimental design [23] . because-omics data tends to be large and complex a network approach that links features in omics datasets can be very useful in gaining a high level view of the system and identifying which features occupy positions of high centrality or which processes are co-regulated in a non-intuitive way. a number of methods exist for inferring networks based on correlation coefficient, mutual information, bayesian probability, random forest analysis and regression analysis. networks of related features have been made for transcriptomic analyses of pathogens [24, 25] , for proteome analyses [17, 26] and for metabolomic or lipidomic analyses [27] . such networks have been used to identify specific processes or pathways that may be responding in tandem across a range of conditions, providing insight into coordination and cross-talk in biological systems as they respond to infection. they can also be used to identify genes of high importance to the organism under analysis [24, 28, 29] , predict gene function [30] [31] [32] or identify regulatory strategies of biological systems [33] [34] [35] . networks focusing on pathogens have identified which genes may be specifically important to virulence [24, 36] and other networks have been focused on expanding this approach further by querying not only the host or pathogen but specifically interactions between these organisms [37] [38] [39] . as networks seek to model complex biological systems with a number of different molecular features (e.g. transcripts, proteins, metabolites), the most accurate networks will be those that can incorporate multiple types of-omics data reporting on these molecules. the lack of apparent correlation between different-omics types (protein and transcript) described above has also emerged when networks of multiple data types are constructed. despite this hurdle, approaches that effectively integrate data across different technological platforms and biomolecular classes are likely to become more important as multi-omics data becomes easier to generate. platforms such as mass-spectrometry and rna-seq also result in missing values when molecules fall below an abundance threshold, and attempts have been made to develop methodologies to incorporate these data types when building networks using mutual information scores [40, 41] . with the goal of creating improved integrated networks, we examine here a number of network inference tools using several types of-omics data (mainly transcriptomic and proteomic) and identify those inference tools that create the most integrated networks, defined as those having the maximum number of edges connecting different types of-omics data, termed cross-type edges. previous studies have focused on ranking network inference methods based on accuracy and have found that genie3, a random forest method, created the best network in terms of its ability to link known regulator-target pairs in escherichia coli [42] . however, there has been no corresponding analysis that systematically ranks network inference methods by their ability to create integrated networks of transcripts and proteins. we perform this ranking here and find that, in most cases, genie3 is also the best inference method to create integrated networks of proteomic and transcriptomic data. we show that these networks, including the cross-type edges in the network, are accurate, and we use this approach to interrogate and compare networks inferred from data derived from antibodymediated entry of dengue virus into cells and from receptor-mediated entry. the methods presented here provide important guidance for constructing multi-omic networks representing host responses to infection, and offer strategies for inferring networks that can act as highly accurate and robust models of cellular systems. transcriptomic and proteomic samples were collected from cells at 2, 8, 16, and 24 hours postinfection. we initially carried out functional enrichment of transcripts and proteins that showed statistically significant changes in expression as a function of infection . this showed that transcripts showing changes in expression in response to infection were enriched for processes such as cytokine signaling, tlr signaling, phagosomes, viral carcinogenesis and response to infection (s1 and s2 tables). proteins showing changes in expression as a function of infection were enriched for some of the same pathways as well as regulation of the cytoskeleton and antigen presentation (s3 table) . we also examined replicate and treatment variation between samples as part of a quality analysis of the dengue dataset. using an adonis test we found that treatments (virus infection, cell type, etc.) accounted for 23.6% of the differences between samples when proteins were examined and 40.7% of the differences when transcripts were examined with both of these results being statistically significant (p-value > 0.001 for both tests). in contrast, replicates accounted for < 3% of differences among samples when either proteins or transcripts were examined with neither of these results being significant (pvalue of 0.858 and 0.981 respectively). networks were then inferred for this dataset where transcripts and proteins were kept as separate nodes in the network. the same gene detected at the transcript or protein level is represented by two separate nodes in the network (though often a gene is only detected at either transcript or protein level, not both). this was done for two reasons: (1) a gene may show different expression levels as a transcript or as a protein due to posttranscriptional regulation. combining transcript and protein levels may therefore give an inaccurate overall view of the gene's expression level. (2) within the cell transcripts and proteins exist simultaneously and may interact with each other and affect each other's expression. to identify these putative interactions it will be necessary to also keep transcript and proteins as separate entities in a network. initial networks using both proteins and transcripts as separate features in the network and pearson correlation coefficient (pcc) as an association method resulted in an extremely low number of edges linking transcripts and proteins compared to those linking transcripts only or proteins only (s1 fig, s4 table) . using a pcc threshold of 0.85 produced a network containing 100000 total edges and 5016 total nodes. within this network, 97,396 edges (97.3%) connected two transcripts, 2592 (2.59%) connected two proteins and only 12 (0.012%) connected a protein and transcript (cross-type edges). this low number of cross-type edges persisted as additional pearson thresholds examined, ranging from~0.81 to~0.96 and producing networks of 200000, 50000, 25000, 12500, 5000 and 2500 edges (fig 1a and 1b) . we next examined whether the differing scales and distributions of proteomic vs. transcriptomic data could be the cause of the lack of cross-type edges. because of differences in the nature of microarrays compared to mass spectrometry and the downstream data analysis expression values for transcripts are generally far lower than expression values for proteins (s2a fig). to artificially correct for this discrepancy, we multiplied all transcriptomic gene expression values by a factor of 2.6. this led to modified transcriptomic values that had a distribution and scale that was highly similar that seen with proteomic expression values (s2b fig). however, when using these modified transcriptomic values, we found that there was no change in the number of cross-type edges inferred (s4 table) . this demonstrates that the differing scales and distributions of transcriptomic vs. proteomic data seen here are not the cause of the lack of cross-type edges and that other factors are at play. there are biological functions that may cause proteins and transcripts to cluster together [22] , but it is also likely that the inference method of choice contributes to the segregation of the network. to explore this further we investigated other methods that may be able to create more integrated networks. aside from pcc, other methods that can be used to link features include spearman correlation coefficient, and those developed to infer transcriptional regulatory interactions from transcriptomic data, but are applicable to inference of associations between different data types, such as mutual information [33, 43] and random forest methods. we examined nine other network inference algorithms in addition to pcc to assess their ability to infer cross-type edges. these 10 inference methods are described in the methods and include correlation coefficient (pearson and spearman), mutual information (clr, methods from the minet package) and random forest (genie3) approaches. we specifically were interested in genie3 as it ranked as the top performer in the dream5 challenge and had been used in the past to infer networks of proteins and transcripts [23] . for each method we inferred seven different networks ranging in size from 200000 to 2500 edges to evaluate the performance of methods. networks with the same letter designation (networks a through g) are matched by size across inference methods. network designations, the edge cutoffs used, the sizes of networks and the number of cross-type edges are shown in s4 table and .sif files for each network are included as a supplementary dataset. while a number of the mutual information based methods improved upon pcc in drawing cross-type edges, genie3, the random forest method, was by far the best method for creating integrated networks (fig 2a) . it was 6.9-fold better than the next best method (minet, from the minet package in r) when examining the largest networks. when examining small networks of 5000 edges it produced 252 cross-type edges, compared to only two for clr, one for pcc and zero for all other methods (s3 fig, s4 table) . similar to pcc, increasing the threshold used to define an edge to make it more stringent led to an increase in the ratio of cross-type edges to total edges. we also found that increasing the edge threshold led to an increase in the ratio of cross-type edges drawn by genie3 compared to those drawn by minet (fig 2b) , at smaller network sizes genie3 has an even larger advantage over other network inference methods. as there were large differences in the ability of each method to infer cross-type edges, we next compared how much edge overlap there was between methods. using networks of 5000 edges from each method we determined the overlap of identical edges between networks. networks were compared using the jaccard similarity of edges (the ratio of intersection of edges/union of edges). fig 3 shows the similarity of all network pairs of the 10 inference methods we tested, with green representing higher jaccard similarity and red lower similarity. the composition of the genie3 network was closest to networks made using clr from the minet package and the original clr algorithm using resampling (methods). the jaccard similarity of edges with several different methods and edge cutoffs were chosen so that all networks of a particular letter are the same size across inference methods. the number of cross-type edges for each network is shown on the y-axis and the methods on the x-axis. clr_o_rs indicates the original clr algorithm with resampling to distinguish it from the version of clr used in the minet package. the aracne method removes edges of low association as part of its methodology and thus only smaller networks containing high scoring edges are retained with aracne. (b) the fold increase in cross-type edges inferred by the genie3 method compared to the minet method (the next best method at creating cross-type edges) is shown. for networks f and g minet did not draw any cross-type edges and so these ratios are not displayed. integrated multi-omic networks comparing genie3 to these methods was 0.174 and 0.184 respectively. when examining the pattern of overlap between other algorithm pairs, two additional clusters emerge. the mrnet, minet, and aracne methods from the minet package cluster together showing jaccard similarity of approximately 0.42. clustering with pcc is the spearman correlation coefficient and two other methods from the minet package, mim and mrnetb, the average jaccard similarity between these methods being 0.612. there are several clusters of methods that produce similar networks, but genie3 produces networks that have more unique edges distinct from networks produced by the other methods. therefore, genie3 is able to detect cross-platform relationships that are not detected by other methods. because there is some variation in overlap of edges between methods (ratios range from 0.104 to 0.99), we next wanted to examine how accurate networks were to address the question, "are different network inference methods producing edges of different quality?" quality of networks can be difficult to assess as they generally provide far more information than has been experimentally verified. previous studies have graded networks based on their ability to link known regulator-target pairs in e. coli and in this study genie3 emerged as the most accurate [42] . however, networks provide additional information beyond regulator-target pairs and we chose a wider metric based on linking genes in the same functional group. here, we focus on the number of edges that connect two features that are in the same functional group as determined using kegg pathways [44, 45] . because there is such a large difference in the ability of each inference method to infer cross-type edges, we used only transcript-transcript edges for this analysis. this approach revealed that there was moderate variability between network inference methods (fig 4) . mrnetb had the highest quality network according to this metric with a functional edge ratio of 46.1%. minet showed the lowest functional edge ratio at 21.2% while genie3 was near the middle with a functional edge ratio of 31.1%. as a baseline we also examined the quality of networks where the identity of all nodes had been randomized. all of the networks improved upon the randomization, which had a functional edge ratio clustered around 7.6 ± 0.012%. importantly, the method we identified as producing the greatest number of cross-type edges, genie3, produced edges of approximately the same quality as the other methods. this indicates that the significant improvement of genie3 in inferring cross-type edges does not come at the expense of network quality. this is consistent with prior studies using genie3 on bacterial systems, which have also found it to be highly accurate [42] . having shown that the large increase in cross-type edges produced by genie3 does not come at the expense of network quality, we next examined how the edge threshold affects the quality of the networks produced by genie3 as assessed by functional edge ratio. increasing the edge threshold leads to networks of higher quality, although even at the least stringent threshold used in this study the resulting 200000 edge network is still better than a randomized network of the same size, showing that networks can be extremely large and still contain high quality information (fig 5a) . it should also be noted that the quality of networks followed an exponential curve as network size was decreased; networks rapidly increase in quality at sizes of 50000 edges or less. the analysis in fig 5 examines only transcript-transcript edges, however, the specific advantage of genie3 is its ability to infer cross-type edges. cross-type edges are far sparser and also are likely to be affected by biological aspects that limit correlation between transcripts and proteins such as post-transcriptional regulation and differences in experimental platform. we next wanted to confirm that cross-type edges, despite being more difficult to infer, also convey high quality functional information. we therefore determined the functional edge ratio of cross-type edges specifically in networks inferred using genie3. this analysis showed that cross-type edges made by genie3 are of lower functional quality than transcript-transcript edges. however, they are still better than edges produced in randomized network at higher edge thresholds (fig 5b) and are therefore providing relevant information regarding the co-expression of transcripts and proteins. it should be noted that this kind of analysis, examining the functional quality of cross-type edges across edge thresholds, is impossible with all other network inference methods tested here as they simply do not create enough cross-type edges (and in many cases do not create any such edges) for analysis. multiple biological factors including post-transcriptional regulation, different rates and mechanisms of rna and protein degradation, as well as protein modifications contribute to differences in regulation of transcript abundance and protein abundance. however, specific properties of the data arising from the experimental approach used (microarray vs. mass spectrometry), with the distribution of the data, or with the missing values in one type of data also may be contributing to the observed low numbers of cross-type edges. to minimize differences among experimental platforms we next examined the effect of averaging biological replicates within the transcriptomic and proteomic datasets. we explored the effect of averaging in genie3 as well as in one representative of each of the three clusters we identified when interrogating network overlap (fig 3) , minet, pcc and the original clr_o_rs. in all network inference methods and at all edge threshold levels, we saw an increase in cross-type edges when replicates were averaged compared to being kept separate (fig 6) . pcc, minet, and clr showed the best overall improvement, far better than genie3, but this is likely because genie3 was already well-suited to drawing crosstype edges and there simply was little room for improvement. however, despite the increase in cross-type edges when replicates were averaged, the functional quality of these edges decreased integrated multi-omic networks and, in the case of minet, clr_o_rs and genie3 networks, was no higher than a randomized network (s4 fig). averaging replicates also reduces the total number of edges in the network. on average, edge thresholds used to generate networks of the same size as those inferred when replicates were not averaged had to be lowered by 1.6-fold. the loss of input data that is seen when replicates are averaged likely makes it more difficult to draw meaningful edges between features, leading to generally smaller networks if the same edge threshold is used. therefore, it seems that averaging replicates prior to inferring networks using multi-omic data is not recommended. we also tested data that had been normalized to the mean of the row as well as fold change data when comparing to time matched mock infected samples. normalizing by the mean of the row had no effect on network structure, which is not surprising since most of the inference methods used rely on assessing relative patterns of expression between features that will not be changed by mean normalization. when determining fold change values we normalized all abundance values to uninfected controls and as a result fewer samples were used to infer the network. mock infected samples were not included as they were used to determine fold change values for infected samples. we also used only proteins with no missing values when determining fold changes across conditions and samples. this smaller dataset contained 38 samples with abundance data for 5000 transcripts and 972 proteins. because this dataset was different than that used above (it has fewer samples and fewer proteins) we re-inferred networks using abundance values from this smaller dataset as well as fold change values. we tested this dataset using genie3 as well as minet and clr, as they were the second and third best methods for creating integrated networks from our previous analyses. interestingly, there was an increase in cross-type edges for both clr and minet when using this smaller dataset with abundance values. despite these increases, genie3 was still the best method for inferring cross-type edges in this smaller dataset using abundance values (fig 7, s4 table) . when using fold change values rather than abundance values all three methods inferred far fewer cross type integrated multi-omic networks edges. using fold change data, genie3 and clr were essentially equal in inferring cross-type edges with genie3 having slight advantage in very large (200000 edges) and very small (5000-2500 edges) networks and clr having a slight advantage with medium sized (10000-12500 edges) networks (fig 7, s4 table) . both clr and genie3 were far better than minet at inferring cross-type edges using fold change data. the above data shows that genie3 is the best method for creating integrated networks when using data from in vitro cell culture infections. we next wanted to see if this advantage held with other datasets drawn from different experimental setups and with other kinds of-omics data, namely lipidomics. to do so we tested other datasets with genie3 as well as minet and clr. we first tested another experiment looking at a different strain of dengue virus (denv1) infecting the same cell type (u937). this experiment not only included transcriptomic data and proteomic data, as the denv4 experiment used above did, but also lipidomic data. when looking at edges linking transcripts and proteins in this new dataset genie3 was again the highest performer, identifying 81 and 177 cross-type edges in 2500 edge and 5000 edge transcript-protein networks respectively compared to only 1 and 8 edges for clr and only 1 edge in each network for minet (s5 fig). as this data included lipidomic data we next looked at cross-type edges in a network comprised of transcripts and lipids, here genie3 again performed the best, identifying 15 and 34 cross-type edges in 2500 edge and 5000 edge transcript-lipid networks respectively compared to only 1 and 13 for clr and 3 and 14 for minet (s5 fig). finally, we also looked at a network comprised of proteins and lipids. here, the advantages of genie3 are not as apparent, minet performs slightly better, though both are better than clr. genie was able to infer 53 and 166 edges compared to 62 and 188 edges for minet (clr inferred 2 and 28 edges) (s5 fig). we next examined other datasets that were different from infection of cells with a virus in vitro. we examined two datasets looking at in vivo infection of mice with influenza virus. with both of these datasets genie3 was again, by far, the best method for inferring cross type edges (s6 fig). in addition, we examined both of these datasets specifically looking at the functional edge overlap of cross-type edges. we found, similar to our analysis of dengue data (fig 5b) , that the functional edge overlap of cross-type edges from networks of this influenza data are significantly higher than that seen in a randomized network (s7 fig). the fact that we see this across multiple datasets from very different experiments (human vs. mouse, in vitro vs. in vivo, various pathogens, etc.) indicates cross-type edges contain biologically relevant data and that this is true across multiple disparate datasets. we also examined another mouse dataset looking at in vivo infection with west nile virus (wnv). again, with this dataset genie3 was the best at inferring cross-type edges (s6 fig). finally, we also examined a dataset comprised of transcriptomic and proteomic analysis of ovarian cancer tumors [22] . here, each tumor represented a unique condition and we chose the 2000 random proteins with the highest variability across tumor samples based on standard deviation. we then combined this dataset with the cognate transcripts matching these proteins for which we had data (a total of 1641 transcripts). with this dataset we found that, across network sizes, there was no significant advantage in using genie3, clr, or minet regarding crosstype edges. with very large networks (200000 edges) genie3 was very similar to minet and both were slightly superior to clr. with the smallest network examined (2500 edges) clr and minet were slightly superior to genie3 with minet and clr each inferring about~500 cross-type edges for this network compared to 347 for genie3 (s6 fig). there were also far more cross-type edges inferred with this data set with~22% of edges linking a protein and a transcript when examining large networks (200000 edges) and~7% when examining small (5000-2500 edges) networks. while this may have been affected by including a large number of protein-transcript cognate pairs from the same gene, a similar analysis using 2000 random proteins and 2000 random transcripts also found a higher number of cross-type edges,~20%. this is in contrast to networks examining dengue infection where only~3% of the edges were crosstype in a large network inferred with genie3 and~5% were cross-type in a small network. having shown with our analysis of dengue virus infection that genie3 is the inference method that is best able to create highly integrated and accurate networks of proteomic and transcriptomic data we applied this approach to comparison of networks derived from receptor-mediated dengue virus infection and antibody-mediated dengue virus infection. studies of denv3 using multiplicities of infection similar to those used here have shown that dengue virus can benefit from antibody dependent enhancement during infection. non-neutralizing monoclonal antibodies against dengue virus already present in the host can increase invasion of this virus into cells of the immune system including monocytes, dendritic cells and macrophages [46] [47] [48] . infection by the antibody-mediated route can also lead to more serious disease development with increased suppression of host antiviral response [46] . because of these differences in viral entry into the cell and subsequent changes in disease we examined networks inferred using only data from dengue infection via an antibody-mediated route and compared that to a network inferred using only data from dengue infection via a receptor-mediated route. both transcriptomic and proteomic data was used to infer integrated networks containing 5000 edges using genie3, one for antibody-mediated infection data and one for receptormediated infection data. network centrality in biological networks is an indicator of the importance of the node (transcript or protein) to the function of the network [16, 49, 50] . pathways with higher centrality in host response networks may be more important to the host response and/or may be more coordinated in terms of their activity. to that end, the centrality of features in kegg pathways was compared between each network to identify functions that may have higher or lower centrality in an antibody vs. receptor-mediated network. we used betweenness, the number of paths through a network that pass through the gene in question, as a measure of centrality. studies of centrality in networks of other organisms (including those looking at human pathogens) have shown that betweenness can be used as a proxy for importance or essentiality with genes of higher betweenness being more important [24, 29] . there were a large number of functional categories whose genes showed differences in betweenness between the two networks (s5 table) including those responding to granulocyte-macrophage colonystimulating factor (gm-csf) and il-4. features that showed decreased expression in response to gm-csf or il-4 had, on average, a 2.2-fold higher centrality value in an antibody-mediated network compared to a receptor-mediated network. this difference was significant using a wilcoxon rank sum test with a p-value of 0.0019 (fdr of 0.144). as with other networks inferred with genie3 both the antibody-mediated and receptor-mediated network were highly integrated. within the category of features showing decreased expression in response to gm-csf or il-4 in each network were both proteins and transcripts and several of these had cross-type edges. to further examine the specific contribution from integrated networks made from multi-omic data with cross-type edges (as opposed to networks inferred from only one type of-omic data) we also inferred the same antibody-mediated and receptor-mediated networks but using only transcriptomic data. when comparing these transcriptomic-only networks we were not able to identify any significant difference in the centrality of genes involved in csf or il-4. in addition, other pathways showing differences in centrality that were related to viral infection such as response to reovirus infection and cytokine response were also not found in a transcript-only network but only in a network of integrated transcriptomic-proteomic data (s6 table) . this observation, combined with the fact that both transcripts and proteins were among those involved with csf or il-4 showing changes in centrality, demonstrates the fact that integration of multi-omic data into a network is contributing to the biological results shown here. it is important to note that this in-depth analysis is carried out only on a single dataset and while we do see that cross-type edges contain biological relevant data across other datasets (s7 fig) it will be of interest in future studies to examine these datasets with the level of detail we apply to the dengue study here. this would reveal whether, and to what degree, cross-type edges can aid in the biological interpretation of networks and what they can show about response of hosts to pathogens or to other stress conditions. based on previous analysis of these features they are likely involved in anti-inflammatory pathways as they are downregulated when t cells proliferate upon being stimulated with il-4 and gm-csf [51] . as centrality is a measure of importance, the more central position of these features in an antibody-mediated entry network suggests that this route of infection for the virus may lead to downregulation of innate immune pathways and lead to the more enhanced disease outcome seen with antibody-mediated vs. receptor-mediated entry. other studies have also identified an upregulation of anti-inflammatory cytokines when antibody-mediated entry is undertaken by dengue [48, 52] . the lack of integration in networks comprised of multiple-omics types can be a serious hindrance to using these networks as accurate, robust and predictive models of biological systems. using in vitro infection data, we show here that genie3, a random forest based method, generates, by far, the most integrated networks of ten inference methods tested. this integration does not come at the cost of edge quality as genie3 performs as well as other methods at linking features of similar functional roles. network analysis is a powerful way to view complex systems such as host-pathogen interactions. using genie3, we built integrated networks of data derived from an antibody-mediated dengue infection experiment and a receptor-mediated dengue infection experiment and highlight innate immune pathways that are different between networks representing both viral entry mechanisms. even when using genie3 cross-type edges remain a minority in the in vitro dataset we examine, comprising only 3.5-5.4% of total edges in the network. there are a number of reasons why such cross-type edges may not be as prevalent as other edges with the most likely reason being regulatory patterns of the cell that are unique to either proteins or transcripts. however, cell timing of transcription and translation as well as methodological elements of all of network inference approaches examined here may also hinder the formation of cross-type edges. when collecting transcriptomic and proteomic data from a biological system such as an infected cell, transcripts identified as highly expressed or changing their expression patterns act as templates for functional proteins that will be translated at some later time point. this is likely to be more prominent in early time points when systems are still adjusting to the perturbations of the experiment. because of this time lag, transcripts coding for genes belonging to certain functional pathways will show their expression changes earlier than corresponding proteins of the same pathway. this lag may make it more difficult to identify instances of coexpression between transcripts and proteins of the same functional category. in contrast, transcript pairs belonging to the same functional category and protein pairs belonging to the same functional category, being expressed at roughly the same time, are more likely to have edges. this is likely also the reason why cross-type edges drawn by genie3 are not as accurate as edges drawn between two transcripts (fig 5b) . it may be of interest to explore methods of building multi-omic networks that attempt to correct this time lag [53] . this might be done by combining early time point transcriptomic data and late time point proteomic data. however, the challenges of combining different -omics datasets from different time points and accurately correcting for the protein translation time lag would be difficult. in addition, we demonstrate here only networks of two omics types (transcript-proteomic, lipid-proteomic, etc.). however, future analyses could infer networks of all three -omics types at once, making a potentially more robust network containing more data. despite these challenges and the small number of cross-type edges, genie3 does emerge as the best method for inferring integrated networks, specifically of proteomic and transcriptomic data. one of the reasons for this may be the fact that, unlike the other nine methods used, genie3 relies on a random forest approach to generate networks. random forest is an ensemble method that can combine strengths of many weak signals to build better models, and is therefore likely more sensitive to biological signals. it is also able to work with different types of input data and does not require linear relationships between features to establish edges (though this quality is shared by several other methods we examined). another possible advantage of genie3 may lie in its ability to handle values of zero supplied as a substitute for missing values in protein data. when networks were inferred with no missing values for any protein under any condition both clr and minet were able to infer more cross-type edges compared to a dataset containing proteins with missing values in some of the conditions. however, genie3 was not affected by this difference in the dataset suggesting that this approach may also be especially useful when inferring networks from datasets with missing values, a common feature of proteomic data. another observation of interest that emerged from these studies was the functional quality of genie3 networks and how this accuracy changes as a function of edge cutoff; smaller, more stringent networks are more accurate than larger networks. this increase in network accuracy among smaller networks was also seen when clr_o_rs and minet were used. while all of the networks used here, aside from those that averaged replicates, showed higher accuracy compared to a randomized network, if edge cutoffs are increasingly made less stringent, a point will eventually be reached where the functional edge ratio is no higher than a randomized network. examining networks made by genie3 suggests that this point will be reached with a network of 371535 edges (representing the top 0.77% of all possible edges). the choice of edge threshold is somewhat arbitrary but our analysis suggests that there is a hard lower limit to network size if quality is to be maintained. while making edge threshold less stringent always leads to larger networks, defining a "floor" will help researchers reach a balance between the maximum number of edges and networks of the best quality. we show here that genie3 is the best method for inferring networks from multiple data types, but when networks are built from the same-omics type other methods we tested here may be a better choice. as we described above, mrnetb created networks of the highest functional quality when examining transcript-transcript edges (fig 4a) . other methods also have the advantage of speed with genie3 being, by far, the slowest of the 10 methods we tested, taking several hours to run while the other methods, including mrnetb, ran in a matter of minutes. among the networks tested, only pearson and spearman distinguish between positive and negative correlation. among these two, spearman was more accurate with a functional edge ratio of 46%, compared to pcc with a functional edge ratio of 36%, suggesting that it may be the superior method when integration is not needed but direction of correlation is. choice of methods depends on experimental needs; mrnetb is fast and accurate but is poor at integration and does not determine whether nodes are positively or negatively correlated. genie3 makes highly integrated networks but is slow and again does not determine whether nodes are positively or negatively correlated. spearman does determine whether nodes are positively or negatively correlated and is fairly accurate but is very poor at integration. it is also important to point out that the dataset we have here, containing nearly 100 datapoints is actually somewhat small compared to other studies using network inference. while there have been several studies that use datasets of a size similar to ours or smaller [54] [55] [56] most studies contain many hundreds of samples. it will be of interest to determine how the choice of network inference method, particularly the use of genie3, may be affected by very large sample sizes. it is also likely that autocorrelation may exist in these samples as they are collected from a relatively short time frame. however, since all of the methods are compared across the same datasets if autocorrelation is present it would affect all network inference methods equally. the advantages of genie3 may be less useful when examining multi-omics data from different experimental setups, specifically in vivo datasets or multi-omic data where both-omics types are drawn from a mass-spectrometry technology. here, multi-omics data from short term, in vitro, time course experiments led to far fewer cross-type edges compared to clinical data collected directly from patients. there are likely many reasons for this but it may be related to the high amount of sampling during in vitro experiments compared to clinical data, with in vitro experiments collecting several samples over a matter of hours rather than over weeks or months as is often the case with clinical data. this increased sampling, used with in vitro experiments, may capture cell responses as they are initially responding to stress perturbations with many regulatory signals happening over a short time frame and leading to more discrepancies between cognate transcript-protein pairs. the resulting paucity of cross-type edges amplified the advantage of genie3, making this method the only choice for this data type in creating networks with a significant number of cross-type edges. in contrast, data collected from clinical experiments is over a much longer time frame or, in some cases, only a single sample is collected. in this second case regulatory pathways controlling transcript and protein expression may have more time to converge, leading to more correlation in the expression and regulation of cognate transcript-protein pairs, more cross-type edges and the option of using other network inference tools. it may also be that uniform nature and response of cells used in in vitro infection experiments may expose and amplify differences between proteins and transcripts. in contrast, samples collected from patients and tumors are a mixture of cell types and responses, which may smooth out inherent differences between transcripts and proteins and lead to more cross-type edges. it is also possible that autocorrelation, which exists in many of the in vitro datasets examined here but not the in vivo tumor sets may also contribute to the larger number of cross-type edges seen with genie3. when examining such clinical data, genie3 is no better or worse with regard to cross-type edges than other methods of inferring networks such as clr and minet. it should be noted that all three of these methods are still superior to correlation based methods such as spearman or pearson. genie3 and minet also find approximately the same number of cross-type edges when examining a proteomic-lipidomic dataset. the fact that a similar result emerges with these two methods only when examining proteomic-lipidomic data, and not proteomic-transcriptomic data or lipidomic-transcriptomic data, suggests that the advantages of genie3 may be in linking multi-omic data across different methodological platforms. when one kind of multi-omic data comes from a mass spectrometry platform and one kind comes from a microarray platform (such as proteomic-transcriptomic data or lipidomics-transcriptomic) genie3 is by far the best option. however, when both-omics types come from a mass spectrometry platform (such as proteomic-lipidomic) then minet is able to identify slightly more cross-type edges compared to genie3. this suggests that when examining two-omics types that are both mass spectrometry based (proteomics, lipidomics, metabolomics, phospho-proteomics) minet may be another viable option. it is important to note however that genie3 was able to infer a higher functional edge overlap than minet and thus may be more accurate. networks made from multi-omics data are some of the best ways to analyze large datasets and provide a high level but gene and protein-specific view of biological systems. the work presented here provides several approaches for making integrated networks of high functional quality that link different data types. our approach has also revealed the differing behavior of anti-inflammatory genes in an antibody-mediated versus receptor-mediated network of dengue infection, information that will be of value as new antiviral treatments are developed for these diseases. while future directions will focus on increasing cross-type edges further and in using network approaches to examine other aspects of viral infections these experiments provide some of the first systematic ranking of methods used to create integrated networks and lay out strategies for how to infer them and which specific methods to use. viruses were propagated in c6/36 aedes albopictus cells grown in minimal essential medium (gibco, grand island, ny) at 32˚c. u937 cells were transfected with a lentivirus vector expressing dc-sign, or passaged in parallel, and sorted by facs. u937 and u937+dc-sign cells were maintained in rpmi-1640 (gibco) at 37˚c. growth media were supplemented with 5% fetal bovine serum (hyclone, logan, ut), 0.1 mm nonessential amino acids (gibco), 100 u/ml penicillin and 100 mg/ml streptomycin (gibco). u937 and u937+dc-sign media was supplemented with 2 mm glutamax (gibco), 10mm hepes (cellgro, manassas, va). u937 +dc-sign media included 2-mercaptoethanol (sigma, st louis, mo). all infection media contained 2% fetal bovine serum. cells were incubated in the presence of 5% co2. infectious clones of wild-type strains were constructed using a quadripartite cdna clone. the denv4 (sri lanka 92) strain was used in the present study. full-length cdna was transcribed into genome-length rnas using t7 polymerase and recombinant viruses isolated in c6/36 cells as previously described. virus was then passaged twice on c6/36 cells, centrifuged to remove cellular debris, and stored at −80˚c as a working stock. fc receptor and dc-sign mediated entry were titered, and at 24 hours post-infection, conditions were optimized to obtain 60% infection for both entry mechanisms. each experiment had five infection conditions: u937+dc-sign mock infected, u937+dc-sign denv infected, u937 mock infected, u937 denv+ab infected and u937 denv + ab isotype control (a control condition for ab-mediated infection). each condition was examined under four timepoints (2, 8, 16 and 24 hours post-infection). each timepoint for each condition was examined with five replicates with a few exceptions, four replicates were collected for the following conditions/timepoints: 2 hours post-infection u937+dc-sign mock infected, 8 hours post-infection u937+dc-sign mock infected, 2 hours post-infection u937 denv+ab infected, 24 hours post-infection u937 denv+ab infected and 2 hours post-infection u937 denv + ab isotype control. with all conditions, timepoints and replicates there were 95 data points comprising this dataset. the virus:antibody/mock mixtures were incubated for 45 minutes in 12-well plates at 37˚c. following this incubation, 1x10 6 u937 or u937+dc-sign cells were added and the infection was allowed to proceed for 2 hours at 37˚c. the 2 hour time point was collected, and the rest of the cells were centrifuged for 2 minutes at 450xg and resuspended in fresh infection media. the collection timepoints were 2, 8, 16, and 24 hours post infection. cells were initially grown and then placed into separate wells for infection assays, with rna and protein being collected from cells in different wells. for each timepoint cells collected for rna were pelleted and resuspended in trizol. cells collected for proteomics were filtered, washed twice with sterile pbs, pelleted and dried. samples were stored at -80˚c until processed. infected cell samples were pelleted and frozen. samples were sent to arraystar (rockville, md) for rna extraction and microarray analysis with the agilent human 4 x 44k gene expression array. total rna from each sample was quantified using the nanodrop nd-1000 and rna integrity was assessed by standard denaturing agarose gel electrophoresis. for microarray analysis, the agilent array platform was employed. the sample preparation and microarray hybridization were performed based on the manufacturer's standard protocols. briefly, total rna from each sample was amplified and transcribed into fluorescent crna using the manufacturer's agilent's quick amp labeling protocol (version 5.7, agilent technologies). the labeled crnas were hybridized onto the whole human genome oligo microarray (4 x 44k, agilent technologies). after washing the slides, the arrays were scanned by the agilent scanner g2505c. background correction was carried out on microarray samples using the maximum likelihood estimation for the normal-exponential convolution model [57] , with an offset of 50, as implemented in bioconductor's [58] limma package [59] . samples were then normalized using quantile normalization so that the entire empirical distribution of each column was identical, this includes log2-transformation of the data. outliers among samples were detected using intensity distribution and a boxplot graph followed by hierarchical clustering and pca analysis of expression profiles using the mva package [60] . any sample that did not visually cluster with other samples of the same condition was removed. for this dataset, only one sample was removed as a result of outlier detection, 1 replicate of the u937 denv+ab 24 hour timepoint. to identify differentially expressed probes, we use bioconductor's limma package [59] , which calculates a p-value based on a moderated t-statistic (recommended for experiments with small sample sizes) and then adjusts it to correct for the effects of multiple hypothesis testing. to adjust the p-value, we use the method of benjamini and hochberg [61] that controls the false discovery rate (fdr or q-value). infected cell samples were pelleted and a 2:1 mixture of chloroform:methanol was added to each sample at a ratio of 5:1 over the sample volume, vortexed, and incubated for 5 minutes on ice. samples were then centrifuged at 12,000 rpm at 4˚c for 10 minutes. the upper (aqueous) and lower (organic) layers were removed to fresh tubes, and dried using a speed-vac. the protein interlayers were washed with ice cold methanol, vortexed, and centrifuged at 12,000 rpm for 10 minutes. following centrifugation, the methanol was removed from the samples and each was allowed to dry completely. pellets were then rehydrated in 100 ul of 8 m urea in 50 mm nh 4 hco 3 buffer. protein concentrations were then determined using the bicinchoninic acid (bca) protein assay (thermofisher pierce, waltham, ma). dithiothreitol (dtt, ther-mofisher pierce, waltham, ma) was added to each sample to obtain a 5 mm concentration and the samples were incubated at 37˚c for 1 hour with shaking at 800 rpm on a thermomixer (eppendorf, hauppauge, ny). iodoacetamide (thermofisher pierce, waltham, ma) was added to a final concentration of 40 mm in each sample and then incubated again at 37˚c for 1 hour in the dark with shaking at 800 rpm on a thermomixer. the samples were then diluted 8-fold with 50 mm nh 4 hco 3 and cacl 2 was added to obtain 1 mm concentration in each sample. trypsin (1:50 trypsin:protein w:w, usb affymetrix, cleveland, oh) was added and the samples were incubated for 3 hours at 37˚c with 800 rpm shaking on a thermomixer. the samples were then flash frozen in liquid nitrogen and stored at -70˚c until the solid phase extraction (spe) cleanup was performed. samples were thawed, centrifuged at 21,000 x g for 5 minutes at rt and then subjected to c18 spe cleanup on strata c18-e 50 mg/1 ml columns (phenomenex, torrance, ca) using an automated spe station (gilson gx-274, middleton, wi). briefly, the columns were conditioned with 3 ml methanol followed by 2 ml of 0.1% trifluoroacetic acid (tfa). after the samples were loaded on the columns, they were rinsed with 4 ml of 95:5 water:acetonitrile with 0.1% tfa. the columns were allowed to dry, after which the samples were eluted with 1 ml of 80:20 acetonitrile:water with 0.1% tfa. the samples were concentrated using a speed vac (thermofisher scientific, waltham, ma) to 50 ul and a final bca protein assay was performed to quantitate the peptide mass. the samples were diluted to 0.1 ug/ul in water for analysis by lc-ms/ms. analysis consisted of reverse-phase lc-ms/ms using waters nano-acquity m-class dual pumping uplc system (milford, ma) configured for online trapping and interfaced with a q-exactive plus hybrid quadrupole/orbitrap mass spectrometer (thermo scientific, san jose, ca). both trapping and analytical columns were packed in-house using 360 μm o.d. fused silica (polymicro technologies inc., phoenix, az) with 1-cm sol-gel frits for media retention and contained jupiter c18 media (phenomenex, torrence, ca) in 5μm particle size for the trapping column (150 μm i.d. x 4 cm long) and 3 μm particle size for the analytical column the initial dataset contained 100 lc-ms instrument runs associated with 100 unique biological samples and 41,947 unique peptides that had at least 2 observations across the 100 biological samples. this corresponded to 5,705 proteins of which 5,699 were human and 6 were viral. the algorithm rmd-pav [62] was used to identify any outlier biological samples, of which 4 were identified (one replicate each from the control for the receptor-mediated dengue virus infection at the 2hr and 8hr timepoint, one replicate from the control for the antibody-mediated dengue virus infection at the 2hr timepoint and one replicate from the antibody-mediated dengue virus infection at the 2hr timepoint) and confirmed via pearson correlation. peptides with inadequate data for either qualitative or quantitative statistical tests were also removed from the dataset [63] , resulting in a final dataset ready for normalization that included 96 unique biological samples and 29,694 measured unique peptides corresponding to 4,333 proteins (4,333 human/0 viral). spans was used and selected l order statistics (los) median centering [64] with a parameter of 0.2 for normalization. peptides were evaluated with analysis of variance (anova) with a dunnett test correction and a bonferroni-corrected g-test to compare each virus to the associated mock within each time point. to perform signature-based protein quantification, bp-quant [65] , each peptide was categorized as a vector of length equal to the number of viruses being evaluated. if all comparisons for all time points are 0 for a specific virus it is considered as non-changing and given a value of 0. if there are more time points with an increase in virus to mock than decreasing it is categorized as a +1 and the contrary -1 is given for the decrease in virus to mock. bp-quant was run with a default parameter of 0.9. all proteins were then analyzed using the same methodology as for the peptides; anova with a dunnett test correction and a bonferroni-corrected g-test to compare each virus to the associated mock within each time point. for dengue-derived networks 5000 human transcripts that were differentially regulated (> 2-fold change, q-value < 0.05) when comparing virus to time-matched mock samples were selected. we also chose 1930 human proteins that were differentially regulated (> 2-fold change, q-value < 0.05) and had missing values in no more than half of the samples examined. missing values that were present in the protein dataset were replaced with "0" to allow compatibility with all network inference methods. the complete dengue dataset consisted of 6930 features examined under 20 conditions with either four or five biological replicates queried for each condition, a total of 95 columns of data. networks were inferred using 10 feature association metrics that were chosen based on several criteria. we chose to use methods that had been used in the past to infer networks [23, [66] [67] [68] , we chose methods that used a variety of mathematical approaches to infer networks such as mutual information, correlation, and random forest, and we chose methods that had previously been examined in the dream5 challenge based on ranking networks by their ability to link known regulator-target pairs in escherichia coli [42] . the inference methods chosen were pearson correlation coefficient (pcc), spearman correlation coefficient, context likelihood of relatedness (clr), a mutual information based metric [33] , six additional mutual information methods in the minet r package (aracne, clr, mim, minet, mrnet, mrnetb) [43] , and genie3, a random forest method that computes the links between each gene p and all other genes j -p as a function of the predictive nature of j -p on p [69] . note that the minet implementation of clr is somewhat different than the original implementation described by faith, et. al. [33] , so was included here as a related, but distinct, method. the original method uses a binning step to transform continuous expression values in to categorical values before calculation of mutual information and z-score, the minet implementation removes this binning step. we also use a resampling approach on the original method [70] that is not used in the minet implementation. briefly, this resampling approach consists of inferring several incident networks after randomly removing 20% of the data columns. a consensus network was then made from these incident networks that averaged z-scores, this consensus network was used for downstream analysis. this resampling with the original clr algorithm is referred to as clr_o_rs here to distinguish it from the clr program used in the minet package. transcripts and proteins were preserved as independent nodes in all networks. for each method a threshold was used for each network to define an edge, pairs of features with edge weights above this threshold were included in the network with an edge drawn between them. once edges were determined to be above the threshold and included in the network all edges were considered equivalent (unweighted). thresholds used to define edges were chosen for each inference method so that networks of identical size could be compared. networks ranged in size from 200000 to 2500 edges, representing the top 0.42 and 0.005% of all possible edges respectively (s4 table) . for all networks transcripts and proteins were kept as separate nodes in the network, the same gene can be represented by either a transcript or a protein (or both) and each of these is a unique node in the network. we also inferred additional networks using transcriptomic and proteomic data from other studies. these include a publically available dataset of in vivo infections of mice with west nile virus (wnv) (geo accession numbers: gse77192, gse77193 and gse78888), publically available datasets of in vivo infections of mice with influenza (geo accession numbers gse68946 and gse71759) and a dataset examining tumor samples from cancer patients [22] . to compare how well networks agreed with known data and how accurate they were, we compared the ability of networks to draw edges between features within the same functional category. annotation of features (transcripts and proteins) was obtained from the kegg database based on the molecular signatures database (msigdb) [44, 45] . ratios were determined for edges connecting two annotated features in the same functional category/all edges connecting two annotated features with these ratios expressed as a percentage. this ratio can range from 0% (no edges connecting annotated genes are those connecting genes in the same functional category) to 100% (all edges connecting annotated genes are those connecting genes in the same functional category) and is referred to as the functional edge ratio, a higher functional edge ratio means more edges linked features in the same functional category and the network is more accurate. when randomizing networks for this comparison an identical node-degree distribution was used and only the identity of the individual nodes was shuffled. inference methods using averaged replicates. the ratio of cross-type edges connecting annotated features in the same functional category to all crosstype edges connecting annotated features is displayed on the y-axis. the network is displayed on the x-axis. blue bars represent networks made with genie3, red bars represent networks made with clr (original algorithm with resampling) and green bars represent networks made with minet. networks 5, 6 and 7 made using clr had no cross-type edges with functional annotation and networks 6 and 7 made using minet had no cross-type edges with functional annotation so these bars are not displayed. the landscape of viral proteomics and its potential to impact human health. expert review of proteomics a comprehensive collection of systems biology data characterizing the host response to viral infection transcriptomic analysis of the dialogue between pseudorabies virus and porcine epithelial cells during infection. bmc genomics the gonococcal transcriptome during infection of the lower genital tract in women transcriptomic analysis of responses to cytopathic bovine viral diarrhea virus-1 (bvdv-1) infection in mdbk cells quantitative proteomic analysis of host-pathogen interactions: a study of acinetobacter baumannii responses to host airways proteomic profiling of serologic response to candida albicans during hostcommensal and host-pathogen interactions global quantitative proteomic analysis profiles host protein expression in response to sendai virus infection impact of salmonella infection on host hormone metabolism revealed by metabolomics global metabolomic analysis of a mammalian host infected with bacillus anthracis lipidomics reveals control of mycobacterium tuberculosis virulence lipids via metabolic coupling comparative lipidomics analysis of hiv-1 particles and their producer cell membrane in different cell lines quantitative proteomics and lipidomics analysis of endoplasmic reticulum of macrophage infected with mycobacterium tuberculosis integrated transcriptomic and proteomic analysis of the physiological response of escherichia coli o157:h7 sakai to steady-state conditions of cold and water activity stress novel insights into human respiratory syncytial virus-host factor interactions through integrated proteomics and transcriptomics analysis temporal proteome and lipidome profiles reveal hepatitis c virus-associated reprogramming of hepatocellular metabolism and bioenergetics topological analysis of protein co-abundance networks identifies novel host targets important for hcv infection and pathogenesis a multi-omic systems approach to elucidating yersinia virulence mechanisms systems analysis of multiple regulator perturbations allows discovery of virulence factors in salmonella the utility of protein and mrna correlation network analysis of epidermal growth factor signaling using integrated genomic, proteomic and phosphorylation data integrated proteogenomic characterization of human high integration of omic networks in a developmental atlas of maize bottlenecks and hubs in inferred networks are important for virulence in salmonella typhimurium transcriptome comparison and gene coexpression network analysis provide a systems view of citrus response to 'candidatus liberibacter asiaticus' infection gene network and proteomic analyses of cardiac responses to pathological and physiological stress a novel untargeted metabolomics correlation-based network analysis incorporating human metabolic reconstructions separating the drivers from the driven: integrative network and pathway approaches aid identification of disease biomarkers from high-throughput data. disease markers integrated in silico analyses of regulatory and metabolic networks of synechococcus sp. pcc 7002 reveal relationships between gene centrality and essentiality guilt by association" is the exception rather than the rule in gene networks biological cluster evaluation for gene function prediction combining guilt-by-association and guilt-by-profiling to predict saccharomyces cerevisiae gene function large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles inferring the relation between transcriptional and posttranscriptional regulation from expression compendia a transcriptional mirna-gene network associated with lung adenocarcinoma metastasis based on the tcga database the effect of inhibition of pp1 and tnfalpha signaling on pathogenesis of sars coronavirus a comprehensive analysis of the transcriptomes of marssonina brunnea and infected poplar leaves to capture vital events in host-pathogen interactions dual-species transcriptional profiling during systemic candidiasis reveals organ-specific host-pathogen interactions interspecies protein-protein interaction network construction for characterization of host-pathogen interactions: a candida albicans-zebrafish interaction study enabling network inference methods to handle missing data and outliers missing data imputation by k nearest neighbours based on grey relational structure and mutual information. applied intelligence wisdom of crowds for robust gene network inference minet: a r/bioconductor package for inferring large transcriptional networks using mutual information the molecular signatures database (msigdb) hallmark gene set collection kegg: kyoto encyclopedia of genes and genomes antibody-dependent enhancement of dengue virus infection inhibits rlr-mediated type-i ifn-independent signalling through upregulation of cellular autophagy evidence that maternal dengue antibodies are important in the development of dengue hemorrhagic fever in infants. the american journal of tropical medicine and hygiene dengue virus life cycle: viral and host factors modulating infectivity. cellular and molecular life sciences: cmls the landscape of human proteins interacting with viruses and other pathogens the importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics hepatocyte growth factor favors monocyte differentiation into regulatory interleukin (il)-10++il-12low/neg accessory cells with dendritic-cell features dengue virus (denv) antibody-dependent enhancement of infection upregulates the production of anti-inflammatory cytokines, but suppresses anti-denv free radical and pro-inflammatory cytokine production, in thp-1 cells. the journal of general virology dynamics of timelagged gene-to-metabolite networks of escherichia coli elucidated by integrative omics approach combining affinity propagation clustering and mutual information network to investigate key genes in fibroid. experimental and therapeutic medicine recursive random forest algorithm for constructing multilayered hierarchical gene regulatory networks that govern biological pathways. plos one identifying key genes in glaucoma based on a benchmarked dataset and the gene regulatory network. experimental and therapeutic medicine microarray background correction: maximum likelihood estimation for the normal-exponential convolution bioconductor: open software development for computational biology and bioinformatics bioinformatics and computational biology solutions using r and bioconductor an introduction to applied multivariate analysis with r controlling the false discovery rate: a practical and powerful approach to multiple testing improved quality control processing of peptide-centric lc-ms proteomics data combined statistical analyses of peptide intensities and peptide occurrences improves identification of significant peptides from ms-based proteomics data a statistical selection strategy for normalization procedures in lc-ms proteomics experiments through dataset-dependent ranking of normalization scaling factors bayesian proteoform modeling improves protein quantification of global proteomic measurements. molecular & cellular proteomics: mcp an integrated systems biology approach identifies trim25 as a key determinant of breast cancer metastasis investigating key genes associated with ovarian cancer by integrating affinity propagation clustering and mutual information network analysis. european review for medical and pharmacological sciences systematic analysis reveals a lncrna-mrna co-expression network associated with platinum resistance in high-grade serous ovarian cancer inferring regulatory networks from expression data using tree-based methods network analysis of transcriptomics expands regulatory landscapes in synechococcus sp. pcc 7002. nucleic acids research key: cord-029277-mjpwkm2u authors: elboher, yizhak yisrael; gottschlich, justin; katz, guy title: an abstraction-based framework for neural network verification date: 2020-06-13 journal: computer aided verification doi: 10.1007/978-3-030-53288-8_3 sha: doc_id: 29277 cord_uid: mjpwkm2u deep neural networks are increasingly being used as controllers for safety-critical systems. because neural networks are opaque, certifying their correctness is a significant challenge. to address this issue, several neural network verification approaches have recently been proposed. however, these approaches afford limited scalability, and applying them to large networks can be challenging. in this paper, we propose a framework that can enhance neural network verification techniques by using over-approximation to reduce the size of the network—thus making it more amenable to verification. we perform the approximation such that if the property holds for the smaller (abstract) network, it holds for the original as well. the over-approximation may be too coarse, in which case the underlying verification tool might return a spurious counterexample. under such conditions, we perform counterexample-guided refinement to adjust the approximation, and then repeat the process. our approach is orthogonal to, and can be integrated with, many existing verification techniques. for evaluation purposes, we integrate it with the recently proposed marabou framework, and observe a significant improvement in marabou’s performance. our experiments demonstrate the great potential of our approach for verifying larger neural networks. machine programming (mp), the automatic generation of software, is showing early signs of fundamentally transforming the way software is developed [15] . a key ingredient employed by mp is the deep neural network (dnn), which has emerged as an effective means to semi-autonomously implement many complex software systems. dnns are artifacts produced by machine learning: a user provides examples of how a system should behave, and a machine learning algorithm generalizes these examples into a dnn capable of correctly handling inputs that it had not seen before. systems with dnn components have obtained unprecedented results in fields such as image recognition [24] , game playing [33] , natural language processing [16] , computer networks [28] , and many others, often surpassing the results obtained by similar systems that have been carefully handcrafted. it seems evident that this trend will increase and intensify, and that dnn components will be deployed in various safety-critical systems [3, 19] . dnns are appealing in that (in some cases) they are easier to create than handcrafted software, while still achieving excellent results. however, their usage also raises a challenge when it comes to certification. undesired behavior has been observed in many state-of-the-art dnns. for example, in many cases slight perturbations to correctly handled inputs can cause severe errors [26, 35] . because many practices for improving the reliability of hand-crafted code have yet to be successfully applied to dnns (e.g., code reviews, coding guidelines, etc.), it remains unclear how to overcome the opacity of dnns, which may limit our ability to certify them before they are deployed. to mitigate this, the formal methods community has begun developing techniques for the formal verification of dnns (e.g., [10, 17, 20, 37] ). these techniques can automatically prove that a dnn always satisfies a prescribed property. unfortunately, the dnn verification problem is computationally difficult (e.g., np-complete, even for simple specifications and networks [20] ), and becomes exponentially more difficult as network sizes increase. thus, despite recent advances in dnn verification techniques, network sizes remain a severely limiting factor. in this work, we propose a technique by which the scalability of many existing verification techniques can be significantly increased. the idea is to apply the well-established notion of abstraction and refinement [6] : replace a network n that is to be verified with a much smaller, abstract network,n , and then verify thisn . becausen is smaller it can be verified more efficiently; and it is constructed in such a way that if it satisfies the specification, the original network n also satisfies it. in the case thatn does not satisfy the specification, the verification procedure provides a counterexample x. this x may be a true counterexample demonstrating that the original network n violates the specification, or it may be spurious. if x is spurious, the networkn is refined to make it more accurate (and slightly larger), and then the process is repeated. a particularly useful variant of this approach is to use the spurious x to guide the refinement process, so that the refinement step rules out x as a counterexample. this variant, known as counterexample-guided abstraction refinement (cegar) [6] , has been successfully applied in many verification contexts. as part of our technique we propose a method for abstracting and refining neural networks. our basic abstraction step merges two neurons into one, thus reducing the overall number of neurons by one. this basic step can be repeated numerous times, significantly reducing the network size. conversely, refinement is performed by splitting a previously merged neuron in two, increasing the network size but making it more closely resemble the original. a key point is that not all pairs of neurons can be merged, as this could result in a network that is smaller but is not an over-approximation of the original. we resolve this by first transforming the original network into an equivalent network where each node belongs to one of four classes, determined by its edge weights and its effect on the network's output; merging neurons from the same class can then be done safely. the actual choice of which neurons to merge or split is performed heuristically. we propose and discuss several possible heuristics. for evaluation purposes, we implemented our approach as a python framework that wraps the marabou verification tool [22] . we then used our framework to verify properties of the airborne collision avoidance system (acas xu) set of benchmarks [20] . our results strongly demonstrate the potential usefulness of abstraction in enhancing existing verification schemes: specifically, in most cases the abstraction-enhanced marabou significantly outperformed the original. further, in most cases the properties in question could indeed be shown to hold or not hold for the original dnn by verifying a small, abstract version thereof. to summarize, our contributions are: (i) we propose a general framework for over-approximating and refining dnns; (ii) we propose several heuristics for abstraction and refinement, to be used within our general framework; and (iii) we provide an implementation of our technique that integrates with the marabou verification tool and use it for evaluation. our code is available online [9] . the rest of this paper is organized as follows. in sect. 2, we provide a brief background on neural networks and their verification. in sect. 3, we describe our general framework for abstracting an refining dnns. in sect. 4, we discuss how to apply these abstraction and refinement steps as part of a cegar procedure, followed by an evaluation in sect. 5. in sect. 6, we discuss related work, and we conclude in sect. 7. a neural network consists of an input layer, an output layer, and one or more intermediate layers called hidden layers. each layer is a collection of nodes, called neurons. each neuron is connected to other neurons by one or more directed edges. in a feedforward neural network, the neurons in the first layer receive input data that sets their initial values. the remaining neurons calculate their values using the weighted values of the neurons that they are connected to through edges from the preceding layer (see fig. 1 ). the output layer provides the resulting value of the dnn for a given input. there are many types of dnns, which may differ in the way their neuron values are computed. typically, a neuron is evaluated by first computing a weighted sum of the preceding layer's neuron values according to the edge weights, and then applying an activation function to this weighted sum [13] . we focus here on the rectified linear unit (relu) activation function [29] , given as relu(x) = max (0, x). thus, if the weighted sum computation yields a positive value, it is kept; and otherwise, it is replaced by zero. more formally, given a dnn n , we use n to denote the number of layers of n . we denote the number of nodes of layer i by s i . layers 1 and n are the input and output layers, respectively. layers 2, . . . , n − 1 are the hidden layers. we denote the value of the j-th node of layer i by v i,j , and denote the column evaluating n is performed by calculating v n for a given input assignment v 1 . this is done by sequentially computing v i for i = 2, 3, . . . , n, each time using the values of v i−1 to compute weighted sums, and then applying the relu activation functions. specifically, layer i (for i > 1) is associated with a weight matrix w i of size s i × s i−1 and a bias vector b i of size s i . if i is a hidden layer, its values are given by where the relus are applied element-wise; and the output layer is given by v n = w n v n−1 + b n (relus are not applied). without loss of generality, in the rest of the paper we assume that all bias values are 0, and can be ignored. this rule is applied repeatedly once for each layer, until v n is eventually computed. we will sometimes use the notation w(v i,j , v i+1,k ) to refer to the entry of w i+1 that represents the weight of the edge between neuron j of layer i and neuron k of layer i + 1. we will also refer to such an edge as an outgoing edge for v i,j , and as an incoming edge for v i+1,k . as part of our abstraction framework, we will sometimes need to consider a suffix of a dnn, in which the first layers of the dnn are omitted. for 1 < i < n, we use n [i] to denote the dnn comprised of layers i, i + 1, . . . , n of the original network. the sizes and weights of the remaining layers are unchanged, and layer i of n is treated as the input layer of n [i] . figure 2 depicts a small neural network. the network has n = 3 layers, of sizes s 1 = 1, s 2 = 2 and s 3 = 1. its weights are w(v 1,1 , v 2,1 ) = 1, w(v 1,1 , v 2,2 ) = −1, w(v 2,1 , v 3,1 ) = 1 and w(v 2,2 , v 3,1 ) = 2. for input v 1,1 = 3, node v 2,1 evaluates to 3 and node v 2,2 evaluates to 0, due to the relu activation function. the output node v 3,1 then evaluates to 3. dnn verification amounts to answering the following question: given a dnn n , which maps input vector x to output vector y, and predicates p and q, does there exist an input x 0 such that p (x 0 ) and q(n (x 0 )) both hold? in other words, the verification process determines whether there exists a particular input that meets the input criterion p , and that is mapped to an output that meets the output criterion q. we refer to n, p, q as the verification query. as is usual in verification, q represents the negation of the desired property. thus, if the query is unsatisfiable (unsat), the property holds; and if it is satisfiable (sat), then x 0 constitutes a counterexample to the property in question. different verification approaches may differ in (i) the kinds of neural networks they allow (specifically, the kinds of activation functions in use); (ii) the kinds of input properties; and (iii) the kinds of output properties. for simplicity, we focus on networks that employ the relu activation function. in addition, our input properties will be conjunctions of linear constraints on the input values. finally, we will assume that our networks have a single output node y, and that the output property is y > c for a given constant c. we stress that these restrictions are for the sake of simplicity. many properties of interest, including those with arbitrary boolean structure and involving multiple neurons, can be reduced into the above single-output setting by adding a few neurons that encode the boolean structure [20, 32] ; see fig. 3 for an example. the number of neurons to be added is typically negligible when compared to the size of the dnn. in particular, this is true for the acas xu family of benchmarks [20] , and also for adversarial robustness queries that use the l ∞ or the l 1 norm as a distance metric [5, 14, 21] . additionally, other piecewise-linear activation functions, such as max-pooling layers, can also be encoded using relus [5] . several techniques have been proposed for solving the aforementioned verification problem in recent years (sect. 6 includes a brief overview). our abstraction technique is designed to be compatible with most of these techniques, by simplifying the network being verified, as we describe next. because the complexity of verifying a neural network is strongly connected to its size [20] , our goal is to transform a verification query ϕ 1 = n, p, q into query ϕ 2 = n , p, q , such that the abstract networkn is significantly smaller than n (notice that properties p and q remain unchanged). we will construct n so that it is an over-approximation of n , meaning that if ϕ 2 is unsat then ϕ 1 is also unsat. more specifically, since our dnns have a single output, we can regard n (x) andn (x) as real values for every input x. to guarantee that ϕ 2 over-approximates ϕ 1 , we will make sure that for every x, n (x) ≤n (x); and fig. 3 . reducing a complex property to the y > 0 form. for the network on the left hand side, suppose we wish to examine the property y2 > y1 ∨ y2 > y3, which is a property that involves multiple outputs and includes a disjunction. we do this (right hand side network) by adding two neurons, t1 and t2, such that t1 = relu(y2 − y1) and t2 = relu(y2 − y3). thus, t1 > 0 if and only if the first disjunct, y2 > y1, holds; and t2 > 0 if and only if the second disjunct, y2 > y3, holds. finally, we add a neuron z1 such that z1 = t1 + t2. it holds that z1 > 0 if and only if t1 > 0 ∨ t2 > 0. thus, we have reduced the complex property into an equivalent property in the desired form. for all x and so ϕ 1 is also unsat. we now propose a framework for generating variousn s with this property. we seek to define an abstraction operator that removes a single neuron from the network, by merging it with another neuron. to do this, we will first transform n into an equivalent network, whose neurons have properties that will facilitate their merging. equivalent here means that for every input vector, both networks produce the exact same output. first, each hidden neuron v i,j of our transformed network will be classified as either a pos neuron or a neg neuron. a neuron is pos if all the weights on its outgoing edges are positive, and is neg if all those weights are negative. second, orthogonally to the pos/neg classification, each hidden neuron will also be classified as either an inc neuron or a dec neuron. increases the output. we first describe this transformation (an illustration of which appears in fig. 4 ), and later we explain how it fits into our abstraction framework. our first step is to transform n into a new network, n , in which every hidden neuron is classified as pos or neg. this transformation is done by replacing each hidden neuron v ij with two neurons, v + i,j and v − i,j , which are respectively pos and neg. both v + i,j an v − i,j retain a copy of all incoming edges of the original v i,j ; however, v + i,j retains just the outgoing edges with positive weights, and v − i,j retains just those with negative weights. outgoing edges with negative weights are removed from v + i,j by setting their weights to 0, and the same is done for outgoing edges with positive weights for v − i,j . formally, for every neuron v i−1,p , where w represents the weights in the new network n . also, for every neuron v i+1,q fig. 4 ). this operation is performed once for every hidden neuron of n , resulting in a network n that is roughly double the size of n . observe that n is indeed equivalent to n , i.e. their outputs are always identical. fig. 4 . classifying neurons as pos/neg and inc/dec. in the initial network (left), the neurons of the second hidden layer are already classified: + and − superscripts indicate pos and neg neurons, respectively; the i superscript and green background indicate inc, and the d superscript and red background indicate dec. classifying node v1,1 is done by first splitting it into two nodes v + 1,1 and v − 1,1 (middle). both nodes have identical incoming edges, but the outgoing edges of v1,1 are partitioned between them, according to the sign of each edge's weight. in the last network (right), v + 1,1 is split once more, into an inc node with outgoing edges only to other inc nodes, and a dec node with outgoing edges only to other dec nodes. node v1,1 is thus transformed into three nodes, each of which can finally be classified as inc or dec. notice that in the worst case, each node is split into four nodes, although for v1,1 three nodes were enough. our second step is to alter n further, into a new network n , where every hidden neuron is either inc or dec (in addition to already being pos or neg). generating n from n is performed by traversing the layers of n backwards, each time handling a single layer and possibly doubling its number of neurons: -initial step: the output layer has a single neuron, y. this neuron is an inc node, because increasing its value will increase the network's output value. -iterative step: observe layer i, and suppose the nodes of layer i + 1 have already been partitioned into inc and dec nodes. observe a neuron v + i,j in layer i which is marked pos (the case for neg is symmetrical). we replace v + i,j with two neurons v +,i i,j and v +,d i,j , which are inc and dec, respectively. both new neurons retain a copy of all incoming edges of v + i,j ; however, v +,i i,j retains only outgoing edges that lead to inc nodes, and v +,d i,j retains only outgoing edges that lead to dec nodes. thus, for every v i−1,p and v i+1,q , where w represents the weights in the new network n . we perform this step for each neuron in layer i, resulting in neurons that are each classified as either inc or dec. to understand the intuition behind this classification, recall that by our assumption all hidden nodes use the relu activation function, which is monotonically increasing. because v + i,j is pos, all its outgoing edges have positive weights, and so if its assignment was to increase (decrease), the assignments of all nodes to which it is connected in the following layer would also increase (decrease). thus, we split v + i,j in two, and make sure one copy, v +,i i,j , is only connected to nodes that need to increase (inc nodes), and that the other copy, v +,d i,j , is only connected to nodes that need to decrease (dec nodes). this ensures that v +,i i,j is itself inc, remain pos nodes, because their outgoing edges all have positive weights. when this procedure terminates, n is equivalent to n , and so also to n ; and n is roughly double the size of n , and roughly four times the size of n . both transformation steps are only performed for hidden neurons, whereas the input and output neurons remain unchanged. this is summarized by the following lemma: lemma 1. any dnn n can be transformed into an equivalent network n where each hidden neuron is pos or neg, and also inc or dec, by increasing its number of neurons by a factor of at most 4. using lemma 1, we can assume without loss of generality that the dnn nodes in our input query ϕ 1 are each marked as pos/neg and as inc/dec. we are now ready to construct the over-approximation networkn . we do this by specifying an abstract operator that merges a pair of neurons in the network (thus reducing network size by one), and can be applied multiple times. the only restrictions are that the two neurons being merged need to be from the same hidden layer, and must share the same pos/neg and inc/dec attributes. consequently, applying abstract to saturation will result in a network with at most 4 neurons in each hidden layer, which over-approximates the original network. this, of course, would be an immense reduction in the number of neurons for most reasonable input networks. the abstract operator's behavior depends on the attributes of the neurons being merged. for simplicity, we will focus on the pos, inc case. let v i,j , v i,k be two hidden neurons of layer i, both classified as pos, inc . because layer i is hidden, we know that layers i + 1 and i − 1 are defined. let v i−1,p and v i+1,q denote arbitrary neurons in the preceding and succeeding layer, respectively. we construct a networkn that is identical to n , except that: (i) nodes v i,j and v i,k are removed and replaced with a new single node, v i,t ; and (ii) all edges that touched nodes v i,j or v i,k are removed, and other edges are untouched. finally, we add new incoming and outgoing edges for the new node v i,t as follows: wherew represents the weights in the new networkn . an illustrative example appears in fig. 5 . intuitively, this definition of abstract seeks to ensure that the new node v i,t always contributes more to the network's output than the two original nodes v i,j and v i,k -so that the new network produces a larger output than the original for every input. by the way we defined the incoming edges of the new neuron v i,t , we are guaranteed that for every input x passed into both n andn , the value assigned to v i,t inn is greater than the values assigned to both v i,j and v i,k in the original network. this works to our advantage, because v i,j and v i,k were both inc-so increasing their values increases the output value. by our definition of the outgoing edges, the values of any inc nodes in layer i + 1 increase inn compared to n , and those of any dec nodes decrease. by definition, this means that the network's overall output increases. the abstraction operation for the neg, inc case is identical to the one described above. for the remaining two cases, i.e. pos, dec and neg, dec , the max operator in the definition is replaced with a min operator. the next lemma (proof omitted due to lack of space) justifies the use of our abstraction step, and can be applied once per each application of abstract: lemma 2. letn be derived from n by a single application of abstract. for every x, it holds thatn (x) ≥ n (x). the aforementioned abstract operator reduces network size by merging neurons, but at the cost of accuracy: whereas for some input x 0 the original network returns n (x 0 ) = 3, the over-approximation networkn created by abstract might returnn (x 0 ) = 5. if our goal is prove that it is never the case that n (x) > 10, this over-approximation may be adequate: we can prove that always and v3 are separate. next (middle), abstract merges v1 and v2 into a single node. for the edge between x1 and the new abstract node we pick the weight 4, which is the maximal weight among edges from x1 to v1 and v2. likewise, the edge between x2 and the abstract node has weight −1. the outgoing edge from the abstract node to y has weight 8, which is the sum of the weights of edges from v1 and v2 to y. next, abstract is applied again to merge v3 with the abstract node, and the weights are adjusted accordingly (right). with every abstraction, the value of y (given as a formula at the bottom of each dnn, where r represents the relu operator) increases. for example, to see that , which holds because relu is a monotonically increasing function and x1 and x2 are non-negative (being, themselves, the output of relu nodes). n (x) ≤ 10, and this will be enough. however, if our goal is to prove that it is never the case that n (x) > 4, the over-approximation is inadequate: it is possible that the property holds for n , but becausen (x 0 ) = 5 > 4, our verification procedure will return x 0 as a spurious counterexample (a counterexample for n that is not a counterexample for n ). in order to handle this situation, we define a refinement operator, refine, that is the inverse of abstract: it trans-formsn into yet another over-approximation,n , with the property that for every x, n (x) ≤n (x) ≤n (x). ifn (x 0 ) = 3.5, it might be a suitable overapproximation for showing that never n (x) > 4. in this section we define the refine operator, and in sect. 4 we explain how to use abstract and refine as part of a cegar-based verification scheme. recall that abstract merges together a couple of neurons that share the same attributes. after a series of applications of abstract, each hidden layer i of the resulting network can be regarded as a partitioning of hidden layer i of the original network, where each partition contains original, concrete neurons that share the same attributes. in the abstract network, each partition is represented by a single, abstract neuron. the weights on the incoming and outgoing edges of this abstract neuron are determined according to the definition of the abstract operator. for example, in the case of an abstract neuronv that represents a set of concrete neurons {v 1 , . . . , v n } all with attributes pos, inc , the weight of each incoming edge tov is given bȳ where u represents a neuron that has not been abstracted yet, and w is the weight function of the original network. the key point here is that the order of abstract operations that merged v 1 , . . . , v n does not matter-but rather, only the fact that they are now grouped together determines the abstract network's weights. the following corollary, which is a direct result of lemma 2, establishes this connection between sequences of abstract applications and partitions: corollary 1. let n be a dnn where each hidden neuron is labeled as pos/neg and inc/dec, and let p be a partitioning of the hidden neurons of n , that only groups together hidden neurons from the same layer that share the same labels. then n and p give rise to an abstract neural networkn , which is obtained by performing a series of abstract operations that group together neurons according to the partitions of p. thisn is an over-approximation of n . we now define a refine operation that is, in a sense, the inverse of abstract. refine takes as input a dnnn that was generated from n via a sequence of abstract operations, and splits a neuron fromn in two. formally, the operator receives the original network n , the partitioning p, and a finer partition p that is obtained from p by splitting a single class in two. the operator then returns a new abstract network,n , that is the abstraction of n according to p . due to corollary 1, and becausen returned by refine corresponds to a partition p of the hidden neurons of n , it is straightforward to show thatn is indeed an over-approximation of n . the other useful property that we require is the following: the second part of the inequality,n (x) ≥ n (x) holds becausen is an over-approximation of n (corollary 1). the first part of the inequality,n (x) ≥ n (x), follows from the fact thatn (x) can be obtained fromn (x) by a single application of abstract. in practice, in order to support the refinement of an abstract dnn, we maintain the current partitioning, i.e. the mapping from concrete neurons to the abstract neurons that represent them. then, when an abstract neuron is selected for refinement (according to some heuristic, such as the one we propose in sect. 4), we adjust the mapping and use it to compute the weights of the edges that touch the affected neuron. in sect. 3 we defined the abstract operator that reduces network size at the cost of reducing network accuracy, and its inverse refine operator that increases network size and restores accuracy. together with a black-box verification procedure verify that can dispatch queries of the form ϕ = n, p, q , these components now allow us to design an abstraction-refinement algorithm for dnn verification, given as algorithm 1 (we assume that all hidden neurons in the input network have already been marked pos/neg and inc/dec). 1: use abstract to generate an initial over-approximationn of n 2: if verify(n, p, q) is unsat then 3: return unsat 4: else 5: extract counterexample c 6: if c is a counterexample for n then 7: return sat 8: else 9: use refine to refinen inton 10:n ←n 11: goto step 2 12: end if 13: end if becausen is obtained via applications of abstract and refine, the soundness of the underlying verify procedure, together with lemmas 2 and 3, guarantees the soundness of algorithm 1. further, the algorithm always terminates: this is the case because all the abstract steps are performed first, followed by a sequence of refine steps. because no additional abstract operations are performed beyond step 1, after finitely many refine stepsn will become identical to n , at which point no spurious counterexample will be found, and the algorithm will terminate with either sat or unsat. of course, termination is only guaranteed when the underlying verify procedure is guaranteed to terminate. there are two steps in the algorithm that we intentionally left ambiguous: step 1, where the initial over-approximation is computed, and step 9, where the current abstraction is refined due to the discovery of a spurious counterexample. the motivation was to make algorithm 1 general, and allow it to be customized by plugging in different heuristics for performing steps 1 and 9, which may depend on the problem at hand. below we propose a few such heuristics. the most naïve way to generate the initial abstraction is to apply the abstract operator to saturation. as previously discussed, abstract can merge together any pair of hidden neurons from a given layer that share the same attributes. since there are four possible attribute combinations, this will result in each hidden layer of the network having four neurons or fewer. this method, which we refer to as abstraction to saturation, produces the smallest abstract networks possible. the downside is that, in some case, these networks might be too coarse, and might require multiple rounds of refinement before a sat or unsat answer can be reached. a different heuristic for producing abstractions that may require fewer refinement steps is as follows. first, we select a finite set of input points, x = {x 1 , . . . , x n }, all of which satisfy the input property p . these points can be generated randomly, or according to some coverage criterion of the input space. the points of x are then used as indicators in estimating when the abstraction has become too coarse: after every abstraction step, we check whether the property still holds for x 1 , . . . , x n , and stop abstracting if this is not the case. the exact technique, which we refer to as indicator-guided abstraction, appears in algorithm 2, which is used to perform step 1 of algorithm 1. another point that is addressed by algorithm 2, besides how many rounds of abstraction should be performed, is which pair of neurons should be merged in every application of abstract. this, too, is determined heuristically. since any pair of neurons that we pick will result in the same reduction in network size, our strategy is to prefer neurons that will result in a more accurate approximation. inaccuracies are caused by the max and min operators within the abstract operator: e.g., in the case of max , every pair of incoming edges with weights a, b are replaced by a single edge with weight max (a, b). our strategy here is to merge the pair of neurons for which the maximal value of |a − b| (over all incoming edges with weights a and b) is minimal. intuitively, this leads to max (a, b) being close to both a and b-which, in turn, leads to an over-approximation network that is smaller than the original, but is close to it weight-wise. we point out that although repeatedly exploring all pairs (line 4) may appear costly, in our experiments the time cost of this step was negligible compared to that of the verification queries that followed. still, if this step happens to become a bottleneck, it is possible to adjust the algorithm to heuristically sample just some of the pairs, and pick the best pair among those considered-without harming the algorithm's soundness. as a small example, consider the network depicted on the left hand side of fig. 5 . this network has three pairs of neurons that can be merged using abstract (any subset of {v 1 , v 2 , v 3 }). consider the pair v 1 , v 2 : the maximal value of |a − b| for these neurons is max (|1 − 4)|, |(−2) − (−1)|) = 3. for pair v 1 , v 3 , the maximal value is 1; and for pair v 2 , v 3 the maximal value is 2. according to the strategy described in algorithm 2, we would first choose to apply abstract on the pair with the minimal maximal value, i.e. on the pair v 1 , v 3 . a refinement step is performed when a spurious counterexample x has been found, indicating that the abstract network is too coarse. in other words, our abstraction steps, and specifically the max and min operators that were used to select edge weights for the abstract neurons, have resulted in the abstract network's output being too great for input x, and we now need to reduce it. thus, our refinement strategies are aimed at applying refine in a way that will result in a significant reduction to the abstract network's output. we note that there may be multiple options for applying refine, on different nodes, such that any of them would remove the spurious counterexample x from the abstract network. in addition, it is not guaranteed that it is possible to remove x with a single application of refine, and multiple consecutive applications may be required. one heuristic approach for refinement follows the well-studied notion of counterexample-guided abstraction refinement [6] . specifically, we leverage the spurious counterexample x in order to identify a concrete neuron v, which is currently mapped into an abstract neuronv, such that splitting v away fromv might rule out counterexample x. to do this, we evaluate the original network on x and compute the value of v (we denote this value by v(x)), and then do the same forv in the abstract network (value denotedv(x)). intuitively, a neuron pair v,v for which the difference |v(x) −v(x)| is significant makes a good candidate for a refinement operation that will split v away fromv. in addition to considering v(x) andv(x), we propose to also consider the weights of the incoming edges of v andv. when these weights differ significantly, this could indicate thatv is too coarse an approximation for v, and should be refined. we argue that by combining these two criteria-edge weight difference between v andv, which is a property of the current abstraction, together with the difference between v(x) andv(x), which is a property of the specific input x, we can identify abstract neurons that have contributed significantly to x being a spurious counterexample. the refinement heuristic is formally defined in algorithm 3. the algorithm traverses the original neurons, looks for the edge weight times assignment value that has changed the most as a result of the current abstraction, and then performs refinement on the neuron at the end of that edge. as was the case with algorithm 2, if considering all possible nodes turns out to be too costly, it is possible to adjust the algorithm to explore only some of the nodes, and pick the best one among those considered-without jeopardizing the algorithm's soundness. (n,n , x) 1: bestneuron ← ⊥, m ← 0 2: for each concrete neuron vi,j of n mapped into abstract neuronv i,j ofn do 3: bestneuron ← vi,j 7: end if 8: end for 9: end for 10: use refine to split bestneuron from its abstract neuron as an example, let us use algorithm 3 to choose a refinement step for the right hand side network of fig. 5 , for a spurious counterexample x 1 , x 2 = 1, 0 . for this input, the original neurons' evaluation is v 1 = 1, v 2 = 4 and v 3 = 2, whereas the abstract neuron that represents them evaluates to 4. suppose v 1 is considered first. in the abstract network,w(x 1 ,v 1 ) = 4 andw(x 2 ,v 1 ) = −1; whereas in the original network, w(x 1 , v 1 ) = 1 and w(x 2 , v 1 ) = −2. thus, the largest value m computed for v 1 is |w( this value of m is larger than the one computed for v 2 (0) and for v 3 (4), and so v 1 is selected for the refinement step. after this step is performed, v 2 and v 3 are still mapped to a single abstract neuron, whereas v 1 is mapped to a separate neuron in the abstract network. our implementation of the abstraction-refinement framework includes modules that read a dnn in the nnet format [19] and a property to be verified, create an initial abstract dnn as described in sect. 4, invoke a black-box verification engine, and perform refinement as described in sect. 4. the process terminates when the underlying engine returns either unsat, or an assignment that is a true counterexample for the original network. for experimentation purposes, we integrated our framework with the marabou dnn verification engine [22] . our implementation and benchmarks are publicly available online [9] . our experiments included verifying several properties of the 45 acas xu dnns for airborne collision avoidance [19, 20] . acas xu is a system designed to produce horizontal turning advisories for an unmanned aircraft (the ownship), with the purpose of preventing a collision with another nearby aircraft (the intruder ). the acas xu system receive as input sensor readings, indicating the location of the intruder relative to the ownship, the speeds of the two aircraft, and their directions (see fig. 6 ). based on these readings, it selects one of 45 dnns, to which the readings are then passed as input. the selected dnn then assigns scores to five output neurons, each representing a possible turning advisory: strong left, weak left, strong right, weak right, or clear-of-conflict (the latter indicating that it is safe to continue along the current trajectory). the neuron with the lowest score represents the selected advisory. we verified several properties of these dnns based on the list of properties that appeared in [20] -specifically focusing on properties that ensure that the dnns always advise clear-of-conflict for distant intruders, and that they are robust to (i.e., do not change their advisories in the presence of) small input perturbations. each of the acas xu dnns has 300 hidden nodes spread across 6 hidden layers, leading to 1200 neurons when the transformation from sect. 3.1 is applied. in our experiments we set out to check whether the abstraction-based approach could indeed prove properties of the acas xu networks on abstract networks that had significantly fewer neurons than the original ones. in addition, we wished to compare the proposed approaches for generating initial abstractions (the abstraction to saturation approach versus the indicator-guided abstraction described in algorithm 2), in order to identify an optimal configuration for our tool. finally, once the optimal configuration has been identified, we used it to compare our tool's performance to that of vanilla marabou. the results are described next. figure 7 depicts a comparison of the two approaches for generating initial abstractions: the abstraction to saturation scheme (x axis), and the indicatorguided abstraction scheme described in algorithm 2 (y axis). each experiment included running our tool twice on the same benchmark (network and property), with an identical configuration except for the initial abstraction being used. the plot depicts the total time (log-scale, in seconds, with a 20-h timeout) spent by marabou solving verification queries as part of the abstraction-refinement procedure. it shows that, in contrast to our intuition, abstraction to saturation almost always outperforms the indicator-guided approach. this is perhaps due to the fact that, although it might entail additional rounds of refinement, the abstraction to saturation approach tends to produce coarse verification queries that are easily solved by marabou, resulting in an overall improved performance. we thus conclude that, at least in the acas xu case, the abstraction to saturation approach is superior to that of indicator-guided abstraction. this experiment also confirms that properties can indeed be proved on abstract networks that are significantly smaller than the original-i.e., despite the initial 4x increase in network size due to the preprocessing phase, the final abstract network on which our abstraction-enhanced approach could solve the query was usually substantially smaller than the original network. specifically, among the abstraction to saturation experiments that terminated, the final network on which the property was shown to be sat or unsat had an average size of 268.8 nodes, compared to the original 310-a 13% reduction. because dnn verification becomes exponentially more difficult as the network size increases, this reduction is highly beneficial. next, we compared our abstraction-enhanced marabou (in abstraction to saturation mode) to the vanilla version. the plot in fig. 8 compares the total query solving time of vanilla marabou (y axis) to that of our approach (x axis). we ran the tools on 90 acas xu benchmarks (2 properties, checked on each of the 45 networks), with a 20-h timeout. we observe that the abstraction-enhanced version significantly outperforms vanilla marabou on average-often solving queries orders-of-magnitude more quickly, and timing out on fewer benchmarks. specifically, the abstraction-enhanced version solved 58 instances, versus 35 solved by marabou. further, over the instances solved by both tools, the abstractionenhanced version had a total query median runtime of 1045 s, versus 63671 s for marabou. interestingly, the average size of the abstract networks for which our tool was able to solve the query was 385 nodes-which is an increase compared to the original 310 nodes. however, the improved runtimes demonstrate that although these networks were slightly larger, they were still much easier to verify, presumably because many of the network's original neurons remained abstracted away. finally, we used our abstraction-enhanced marabou to verify adversarial robustness properties [35] . intuitively, an adversarial robustness property states that slight input perturbations cannot cause sudden spikes in the network's output. this is desirable because such sudden spikes can lead to misclassification of inputs. unlike the acas xu domain-specific properties [20] , whose formulation required input from human experts, adversarial robustness is a universal property, desirable for every dnn. consequently it is easier to formulate, and has received much attention (e.g., [2, 10, 20, 36] ). in order to formulate adversarial robustness properties for the acas xu networks, we randomly sampled the acas xu dnns to identify input points where the selected output advisory, indicated by an output neuron y i , received a much lower score than the second-best advisory, y j (recall that the advisory with the lowest score is selected). for such an input point x 0 , we then posed the verification query: does there exist a point x that is close to x 0 , but for which y j receives a lower score than y i ? or, more formally: if this query is sat then there exists an input x whose distance to x 0 is at most δ, but for which the network assigns a better (lower) score to advisory y j than to y i . however, if this query is unsat, no such point x exists. because we select point x 0 such that y i is initially much smaller than y j , we expect the query to be unsat for small values of δ. for each of the 45 acas xu networks, we created robustness queries for 20 distinct input points-producing a total of 900 verification queries (we arbitrarily set δ = 0.1). for each of these queries we compared the runtime of vanilla marabou to that of our abstraction-enhanced version (with a 20-h timeout). the results are depicted in fig. 9 . vanilla marabou was able to solve more instances-893 out of 900, versus 805 that the abstraction-enhanced version was able to solve. however, on the vast majority of the remaining experiments, the abstraction-enhanced version was significantly faster, with a total query median runtime of only 0.026 s versus 15.07 s in the vanilla version (over the 805 benchmarks solved by both tools). this impressive 99% improvement in performance highlights the usefulness of our approach also in the context of adversarial robustness. in addition, over the solved benchmarks, the average size of the abstract networks for which our tool was able to solve the query was 104.4 nodes, versus 310 nodes in each of the original networks-a 66% reduction in size. this reinforces our statement that, in many cases, dnns contain a great deal of unneeded neurons, which can safely be removed by the abstraction process for the purpose of verification. in recent years, multiple schemes have been proposed for the verification of neural networks. these include smt-based approaches, such as marabou [22, 23] , reluplex [20] , dlv [17] and others; approaches based on formulating the problem as a mixed integer linear programming instance (e.g., [4, 7, 8, 36] ); approaches that use sophisticated symbolic interval propagation [37] , or abstract interpretation [10] ; and others (e.g., [1, 18, 25, 27, 30, 38, 39] ). these approaches have been applied in a variety of tasks, such as measuring adversarial robustness [2, 17] , neural network simplification [11] , neural network modification [12] , and many others (e.g., [23, 34] ). our approach can be integrated with any sound and complete solver as its engine, and then applied towards any of the aforementioned tasks. incomplete solvers could also be used and might afford better performance, but this could result in our approach also becoming incomplete. some existing dnn verification techniques incorporate abstraction elements. in [31] , the authors use abstraction to over-approximate the sigmoid activation function with a collection of rectangles. if the abstract verification query they produce is unsat, then so is the original. when a spurious counterexample is found, an arbitrary refinement step is performed. the authors report limited scalability, tackling only networks with a few dozen neurons. abstraction techniques also appear in the ai2 approach [10] , but there it is the input property and reachable regions that are over-approximated, as opposed to the dnn itself. combining this kind of input-focused abstraction with our network-focused abstraction is an interesting avenue for future work. with deep neural networks becoming widespread and with their forthcoming integration into safety-critical systems, there is an urgent need for scalable techniques to verify and reason about them. however, the size of these networks poses a serious challenge. abstraction-based techniques can mitigate this difficulty, by replacing networks with smaller versions thereof to be verified, without compromising the soundness of the verification procedure. the abstraction-based approach we have proposed here can provide a significant reduction in network size, thus boosting the performance of existing verification technology. in the future, we plan to continue this work along several axes. first, we intend to investigate refinement heuristics that can split an abstract neuron into two arbitrary sized neurons. in addition, we will investigate abstraction schemes for networks that use additional activation functions, beyond relus. finally, we plan to make our abstraction scheme parallelizable, allowing users to use multiple worker nodes to explore different combinations of abstraction and refinement steps, hopefully leading to faster convergence. optimization and abstraction: a synergistic approach for analyzing neural network robustness measuring neural net robustness with constraints end to end learning for self-driving cars piecewise linear neural network verification: a comparative study provably minimally-distorted adversarial examples counterexample-guided abstraction refinement output range analysis for deep neural networks formal verification of piece-wise linear feed-forward neural networks an abstraction-based framework for neural network verification: proof-of-concept implementation ai2: safety and robustness certification of neural networks with abstract interpretation simplifying neural networks using formal verification minimal modifications of deep neural networks using verification deep learning deepsafe: a data-driven approach for assessing robustness of neural networks the three pillars of machine programming deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups safety verification of deep neural networks verifying recurrent neural networks using invariant inference policy compression for aircraft collision avoidance systems reluplex: an efficient smt solver for verifying deep neural networks towards proving the adversarial robustness of deep neural networks the marabou framework for verification and analysis of deep neural networks verifying deep-rl-driven systems imagenet classification with deep convolutional neural networks toward scalable verification for safety-critical deep networks adversarial examples in the physical world an approach to reachability analysis for feed-forward relu neural networks neural adaptive video streaming with pensieve rectified linear units improve restricted boltzmann machines verifying properties of binarized deep neural networks an abstraction-refinement approach to verification of artificial neural networks reachability analysis of deep neural networks with provable guarantees mastering the game of go with deep neural networks and tree search formal verification of neural network controlled autonomous systems intriguing properties of neural networks evaluating robustness of neural networks with mixed integer programming formal security analysis of neural networks using symbolic intervals parallelization techniques for verifying neural networks output reachable set estimation and verification for multilayer neural networks we thank the anonymous reviewers for their insightful comments. this project was partially supported by grants from the binational science foundation (2017662) and the israel science foundation (683/18).an abstraction-based framework for neural network verification open access this chapter is licensed under the terms of the creative commons attribution 4.0 international license (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license and indicate if changes were made.the images or other third party material in this chapter are included in the chapter's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the chapter's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. key: cord-027851-95bsoea2 authors: wang, daojuan; schøtt, thomas title: coupling between financing and innovation in a startup: embedded in networks with investors and researchers date: 2020-06-25 journal: int entrep manag j doi: 10.1007/s11365-020-00681-y sha: doc_id: 27851 cord_uid: 95bsoea2 innovation may be a basis for starting a business, and financing is typically needed for starting. innovation and financing may conceivably be negatively related, or be unrelated, or plausibly be beneficially related. these possible scenarios frame the questions: what is the coupling between innovation and financing at inception, and what is the embeddedness of coupling in networks around the entrepreneur, specifically networks with investors and researchers? these questions are addressed with a globally representative sample of entrepreneurs interviewed at inception of their business. innovation and financing are found to be decoupled, typically; less frequently to be loosely coupled, and rarely to be tightly coupled. coupling is promoted by networking with both investors and researchers, with additive effects and with a synergy effect. by ascertaining coupling and its embeddedness in networks as a way for building capability in a startup, the study contributes to empirically supported theorizing about capability building. innovation may be the basis for starting a business. an entrepreneur may innovate something that is novel to some potential customers, or may use a technique that has not been used earlier, or may be doing something that only few other businesses are doing. in these ways, innovation may be a foundation for the startup. much scholarship studies innovation. a stream of research focuses on innovation as a basis for starting a business (e.g. weiblen and chesbrough 2015; colombelli et al. 2016; colombelli and quatraro 2017) . financing may be needed for starting. indeed, typically, a startup requires some financing. an entrepreneur may need little financing, but occasionally requires much financing. typically, an entrepreneur has some funds of their own for investing in the startup. frequently, an entrepreneur also requires some funding from other sources, such as family and friends. an entrepreneur often borrows from a loan organization, and sometimes obtains venture capital. a stream of research focuses on financing of startups (e.g. van osnabrugge and robinson 2000; hsu 2004; mason and stark 2004; croce et al. 2017) . innovation and financing may be unrelated in a startup. innovation may be accomplished before starting a business, or may be completed on a shoestring budget, and in these situations innovation and financing are unrelated. furthermore, potential investors may shy away from the riskiness of supporting innovation and prefer to invest in routine production, and also in this situation there is no relation between financing and innovation in a startup. in the extreme, if innovation is pursued without financing, and if a copy-cat startup attracts financing, then innovation and financing are even negatively related. conversely, innovation may require financing and the entrepreneur may obtain it, and an investor may prefer to finance an innovative rather than a routine startup. in such a situation, innovation and financing will go hand in hand and be coupled beneficially. these scenariosa negative coupling between innovation and financing, or no coupling between them, or a beneficial coupling between themrepresent a gap in our understanding of startups. this gap frames our first question for research: what is the coupling between innovation and financing at inception? a second issue is the sources of their coupling. an entrepreneur has networks that channel, enable and constrain the endeavor. an entrepreneur may seek financing by networking with formal and informal investors. an entrepreneur may pursue innovation by networking with researchers and inventors. and an entrepreneur may couple financing and innovation by networking with both investors and researchers. investors and researchers are not substitutable partners. rather, investors and researchers provide complementary resources, and there may even be a synergy between their inputs into a startup. such embeddedness of coupling is another gap in our understanding of new ventures. this gap frames our second question: what is the embeddedness of coupling in the networks around an entrepreneur, specifically the networks with investors and researchers? by addressing these two gaps in understanding a startup, this study makes several specific contributions. first, coupling of innovation and financing at inception is a way of building capability in the new business, and the study thus contributes to our understanding of capability building. second, this study fills a gap by investigating whether and how these two important elementsinnovation and financingare channeled, enabled and constrained by different networksespecially networks with the investors and researchersaround the entrepreneur. third, by focusing on the business founding stage, we overcome the methodological problems caused by hindsight bias, memory decay, and survivorship bias, which pervade retrospective studies. the following first reviews the theoretical background as a basis for developing hypotheses, then describes the research design, reports analyses, and concludes by relating findings to the literature of entrepreneurship, venture capital and investment. the phenomenon of financing and innovation being interrelated is conceptualized as a coupling. the concept of coupling is classical in studies of organizations (weick 1976; orton and weick 1990) . elements of an organization have a coupling, in that they tend to occur together and to be connected, intertwined, reciprocal, reinforcing, and mutually sustaining within the organization. the coupling has strength; it may be loose, in that the elements are rather independent of one another, or it may be tight, in that the elements are highly interdependent. loose coupling is often found in an educational organization, whereas tight coupling is more frequent in a firm (ibid.). here we apply the concept of coupling to the intertwining between two elements of a startup: financing and innovation. coupling is tight if the startup is financing and innovating simultaneously. coupling is loose if the startup is pursuing one of the two, but hardly the other. the elements may be termed decoupled if the startup is pursuing one of the two, but not the other at all. finally, of course, a startup may be without ambition of any kind; not pursuing any of the endeavors. an established business may benefit from self-reinforcing dynamics between innovation and financing. accomplished innovation may attract investors, and, reciprocally, financing is a means for innovation. for a nascent entrepreneur, however, the dynamics between financing and innovation are quite different. the entrepreneur is in the process of starting. there cannot yet be any reciprocal interaction between market feedback and the entrepreneur's learning and capability development. no market returns from sales can be employed to strengthen innovation capability. moreover, the business opportunity pursued by the entrepreneur is still up for evaluation and modification; the business is still an idea. nevertheless, at this formative stage, there are interactions between the entrepreneur and stakeholders, e.g., potential investors, inventors, and incubators. through this interaction, the entrepreneur modifies ideas and visions for the business and anticipates feasibility, outcomes and attractiveness (lichtenstein et al. 2006) . in this realm, the entrepreneur shapes strategic aspirations, and confidence in achieving specific strategic goals. ambition for financing and ambition for innovation tend to co-evolve as they build on similar underlying organizational strengths. hence, we may reasonably expect that in some resource environments around the entrepreneur there will be a presence of ambition for financing and simultaneously also ambition for innovation. moreover, it is reasonable to theorize that the more the entrepreneur is exposed to such environments, the more likely the entrepreneur's aspirations are to include elements of financing and innovation. this external influence on the entrepreneur's aspiration formation processes is represented by ansoff (1987) in his model of paradigmatic complexity of the strategy formation process. at a more general level, it is also represented in ajzen's model of planned behavior, in which stakeholders' resources, norms and expectations shape the entrepreneur's intentions, goals and aspirations (liñán and chen 2009) . a similar prediction can be made from social comparison theory (boyd and vozikis 1994) . because the entrepreneur's business is not yet tested in the market, the entrepreneur may rely on modeling and imitation as a source of self-efficacy. in this process beliefs in own capabilities, and thereby aspirations, will be assessed by successes and failures of similar others (wood and bandura 1989) . finally, the entrepreneur's aspirations may be influenced by persuasion through encouragement, even in situations where encouragement is given on unrealistic grounds (ibid.). these mechanisms may work through direct relationships held by the entrepreneur or indirectly as the entrepreneur observes and interprets stimuli from the wider environment. coupling may be pursued as a strategy. as a strategy it may be partly based on an analysis of strengths, weaknesses, opportunities and threats, swot, as business students learn and managers and owners apply. an entrepreneur can hardly estimate any of the elements with reasonable validity and reliability. but the entrepreneur is likely to discuss such matters with others, listen to them, and take their advice into consideration when pursuing financing and innovation. thus the coupling is likely to be influenced by the network around the entrepreneur, the network of people giving the entrepreneur advice on the new business. as suggested by literature on entrepreneurial opportunity and alertness, it is a social process to create and grow a new venture, entailing efforts by entrepreneurs to use their networks to mobilize and deploy resources to exploit an identified opportunity and achieve the success (ebbers 2014; adomako et al. 2018) . besides, "an important part of the nascent entrepreneurial process is a continuing evaluation of the opportunity, resulting in learning and changes in beliefs" (mccann and vroom 2015) . the pursuit of coupling is thus embedded in the advice network, which channels, enables and constrains beliefs and strategy, specifically pursuit of coupling of financing and innovation. the people giving advice to the entrepreneur are often drawn from a wide spectrum, both from the private sphere of family and friends and from the public sphere comprising the work-place, the professions, the market and the international environment (jensen and schøtt 2017 ). an entrepreneur's networking in the private sphere and networking in the public sphere differ in their consequences for the startup. networking in the public sphere promotes, whereas networking in the private sphere impedes, such business endeavors as innovation, exporting and expectations for growth (schøtt and sedaghat 2014; ashourizadeh and schøtt 2015; schøtt and cheraghi 2015) . we here consider how such networking influences coupling of financing and innovation. an entrepreneur's networking in the private sphere of family and friends may shape the coupling between innovation and financing through its influence on ambition of the entrepreneur. the entrepreneur's family is often putting its wealth at risk in the startup, and is likely to be cautious and to caution the entrepreneur against being overly selfefficacious, overly optimistic about business opportunities, and overly risk-willing. furthermore, due to mutual trust, frequent contacts, intimacy and reciprocal commitments in such relationships (granovetter 1973 (granovetter , 1985 greve 1995; anderson et al. 2005) , their influence tends to be deep and significant. when private sphere networking constrains the entrepreneurial mindset, the entrepreneur becomes less ambitious and will pursue less financing or less innovation, and will be especially reluctant to pursue financing for innovation. conversely, an entrepreneur without such a constraining network in the private sphere will plausibly feel rather free, and will more wishfully think of own capability, of own efficacy, of opportunities, and of risks, and consequently will be more ambitious and therefore also pursue both financing and innovation. family members and friends tend to move within the same circles with the entrepreneur (anderson et al. 2005) . they know each other and are likely to have high degree of social, cultural, educational and professional homophily (granovetter 1973 (granovetter , 1985 greve 1995) . the members within such network are likely to possess or access much overlapping information and multiple redundant ties therefore often add little value when an entrepreneur is seeking novel resources/information and financing. the consideration concerning the private sphere leads us to hypothesize, hypothesis 1: networking within the private sphere reduces coupling between financing and innovation. public sphere networking shaping coupling an entrepreneur's networking for advice in the public sphere is drawn from the workplace, professions, market and the international environment. these formal and informal advisors are mostly business people and business-related people. they are likely to be more self-efficacious, optimistic about opportunities, and risk-willing, than the entrepreneur's private sphere network. they are likely to influence the entrepreneur to be more self-efficacious, optimistic and risk-willing, and thereby more ambitious and more likely to pursue both financing and innovation. apart from such positive mindset influence, a diverse set of persons working in different public contexts with quite different knowledge bases, experiences, mental patterns, and associations enable the entrepreneur to access to a broad array of nonredundant novel ideas and expanded financing opportunities (hsu 2005; burt 2004; dyer et al. 2008) . particularly, some critical contacts in the public sphere, such as venture capitalists, successful entrepreneurs, and business incubators, not only directly bring the nascent entrepreneur valuable suggestions, creative ideas, and financial resources simultaneously, but also play the role of business referrals and endorsements and further broaden the entrepreneur's opportunities for acquiring and enhancing innovation and financing capabilities (van osnabrugge and robinson 2000; mason and stark 2004; löfsten and lindelöf 2005; cooper and park 2008; ramos-rodríguez et al. 2010; croce et al. 2017) , generating a "snowballing effect". these arguments thus lead us to specify, hypothesis 2: networking in the public sphere promotes coupling between financing and innovation. investors, especially venture capitalists and angel investors, often appreciate and encourage innovation with financial support (kortum and lerner 2000; engel and keilbach 2007; bertoni et al. 2010) . investors frequently bring the entrepreneurs more than purely financial capital, such as their technical expertise, market knowledge, customer resources, strategic advices, and network augmentation (sapienza and de clercq 2000; mason and stark 2004; brown et al. 2018) . investors, angel investors and vcs like to syndicate their investments with others, and to share the investment risk and strengthen evaluating and monitoring capacities (kaplan and strömberg 2004; wong et al. 2009; brown et al. 2018 ), which will expand and strengthen their financial and innovation support. as observed by brown et al. (2018) , a key feature of the entrepreneurs who use equity crowdfunding is their willingness to innovate and they are very proficient at combining financial resources from different sources and drawing on the networks to alleviate and overcome their internal resource constraints. therefore, networking with these investors is likely to spur and enable the entrepreneur in risktaking and innovative behavior. meanwhile, being in the investors' circle, the entrepreneur is easily identified and accessed. in the networking process, the actors learn more about each other, trust emerges from repeated interactions, and then stimulates closer interpersonal interaction and mitigates the fear of opportunistic behaviors caused by information asymmetry (jensen and meckling 1976; de bettignies and brander 2007) . moreover, the endorsement by reputable investors can send a favorable signal to the investment market about the entrepreneur and the project, and attract more investors to join (see zip case by steier and greenwood 2000) . especially, as found by van osnabrugge and robinson (2000) , angel investors often have entrepreneurial and business operation experience, and have empathy for an innovative entrepreneur, and have the passion to help, and perform less due diligence but invest more by instincts. altogether, this may enhance the matching opportunity between innovative ideas and funding needs and investment desire, leading to a coupling between innovation and financing. therefore, we hypothesize, hypothesis 3: networking with potential investors promotes coupling between financing and innovation. timmons and bygrave (1986, p.170) identified a shared view between founders of innovative ventures and venture capitalists that "the roots of new technological opportunities depend upon a continuing flow of knowledge from basic research". thus researchers and inventors are generators and carriers of knowledge, intellectual property, and patents. by networking with them, the entrepreneur may acquire these innovative resources. codified and tacit knowledge is transferred in different ways, notably through education, consulting, and r&d-based project cooperation, and conversations. indeed, the benefits of networking with researchers or inventors is expressed in arrangements in innovation systems, such as the triple helix model (etzkowitz 2003) ; science parks (löfsten and lindelöf 2005) , entrepreneurial universities, incubators, research-based spin-offs, open innovation (etzkowitz 2003; rothaermel et al. 2007; enkel et al. 2009 ), and industrial ph.d. projects. these models, polices, organizational formats and education programs are proposed with the same strategic intention: to provide a nurturing environment, and link talent, technology, capital and know-how to spur innovation and commercialization of technology. networking with researchers and inventors not only enables the entrepreneur to tap into a broader research community, but may also sends a signal to the market about the quality and veracity of the project and its knowledge foundation, and may reduce the investors' worries about their investment (hsu 2004; murray 2004) , especially for an early-stage entrepreneur without established reputation and performance record, and particularly when the venture is innovative. therefore, we propose: hypothesis 4:networking with researchers promotes coupling between financing and innovation. networking with both investors and researchers can generate synergy leading to further coupling of innovation and financing as elaborated in the following. as argued above, networking with investors and with researchers or inventors separately can provide the entrepreneur with both financial resources, knowledge and talents for innovation. when the entrepreneur networks with both investors and researchers, the resources obtained from the two parties may generate an additional "positive loop effect", which means more sophisticated innovation brought by the ties with researchers and inventors attract more capital, and more capital available for r&d further enhance innovation aspiration, which again attract more capital and then more r&d investment, and then enhance innovation; in mutual reinforcement. moreover, networking with both an investor and a researcher, implies that when legitimacy is obtained from one of the two, this sends a signal to the other encouraging the other to bestow legitimacy on the entrepreneur, which may attract further financing and ideas for innovation. we may call this a "reinforced signaling effect". the ties with researchers, investors, and their network contacts help open up more relations for acquiring additional funds and knowledge like "reinforced snowball effect". timmons and bygrave (1986) had observed that there were geographical oases for incubating a bulk of innovative technological ventures, where the founders, entrepreneurs, technologists, and investors cluster. using a longitudinal case study, calia et al. (2007) illustrate how a technological innovation network (with the involvement of universities, venture investors, and banks) enables a case company to establish its business and to survive and grow. these synergies suggest an effect that is over and above the two separate effects of networking with investors and networking with researchers, hypothesis 5: networking with both investors and researchers further enhances coupling between financing and innovation. the hypothesized effects are illustrated in fig. 1 . the world's entrepreneurs are surveyed by the global entrepreneurship monitor (bosma 2013) . in most countries covered in the period 2009 to 2014 the survey included questions about networking, financing and innovation. sampling gem samples adults in two stages. the first stage occurs when a country is included, namely when a national team is formed and joins gem to conduct the survey in its country. hereby 50 countries were covered where the essential questions were asked. these countries are drawn from a diversity of regions, cultures, economies, and levels of development, and form a sample of countries which is fairly representative of the countries around the world. the second stage of sampling is the fairly random sampling of adults within a country, and then identifying the starting entrepreneurs. entrepreneurs at inception are identified as those who are currently trying to start a business, have taken action to start, will own all or part of the business, and have not yet received, or just begun to receive, some kind of compensation. by this identification of entrepreneurs, this sample is 10,582 entrepreneurs who reported their networking, financing, and innovation. representativeness of sampling enables generalization to the world's starting entrepreneurs and their startups. financing of the startup was measured by asking the entrepreneur, how much money, in total, will be required to start this new business? please include both loans and equity/ownership investments. the amount is recorded in the local currency, an amount from 0 upward. to make this comparable across countries, the amount is normalized by dividing by the median for the country's responding entrepreneurs. then, to reduce the skew, we take the logarithm (first adding 1), a measure that runs from 0 for no financing, and then upward. this indicator of financing enters into the measurement of coupling. innovativeness in the startup was indicated by asking three questions, have the technologies or procedures required for this product or service been available for less than a year, or between one to five years, or longer than five years? will all, some, or none of your potential customers consider this product or service new and unfamiliar? right now, are there many, few, or no other businesses offering the same products or services to your potential customers? the answer to each question is here coded 0, 1, 2 for increasing innovativeness. the three measures are inter-correlated positively. the three measures are averaged as an index of innovation, running from 0 to 2. this index of innovation enters into the measurement of coupling. two business practices, here innovation and financing, are coupled in so far as they are pursued jointly. the coupling between two practices in a business is indicated by their co-occurrence at inception of the business. coupling between innovation and financing is high to the extent that innovation is high and financing is high. conversely, coupling is low when either of them is low. when the occurrence of each practice is measured on a scale from 0 upward, the coupling of the two practices is indicated by the product of the two measures: if financing is 0 or if innovation is 0, then coupling is 0. conversely, if both financing is high and innovation is high, then coupling is very high. the scale has no intrinsic meaning, so, for analyses, the measure of coupling is standardized. validity can be ascertained. coupling expectedly correlates positively with expectation for growth, as an indication of performance at inception. growth-expectation is indicated as expected number of persons working for the business when five years old (transformed logarithmically to reduce skew). the correlation is positive (.26 with p < .0005) confirming validity of the operationalization of coupling. the network around an entrepreneur is indicated by asking the entrepreneur to report on getting advice, various people may give you advice on your new business. have you received advice from any of the following? your spouse or life-companion? your parents? other family or relatives? friends? current work colleagues? a current boss? somebody in another country? somebody who has come from abroad? somebody who is starting a business? somebody with much business experience? a researcher or inventor? a possible investor? a bank? a lawyer? an accountant? a public advising services for business? a firm that you collaborate with? a firm that you compete with? a supplier? a customer? networking in the private sphere is measured as number of advisors among the four: spouse, parent, other family, and friends, a measure going from 0 to 3. networking with a researcher or inventor is measured dichotomously, 1 if advised by a researcher or inventor, and 0 if not. networking with a possible investor is measured dichotomously, likewise, 1 if advised by a possible investor, and 0 if not. the network with others in the public sphere is measured as number of advisors among the other 14, a measure going from 0 to 14 (jensen and schøtt 2017) . validity can be assessed. in the theoretical section we argued that private sphere networking is associated negatively, and public sphere networking is associated positively, with self-efficacy and opportunity-perception. these correlations all turn out to be as expected indicating validity of the operationalization of networks. the analysis controls for attributes of the entrepreneur and the business. gender is coded 0 for males and 1 for females. age is measured in years. education is indicated in years of schooling. motive for starting the business is either seeing a business opportunity or necessity to make a living, coded 1 and 0, respectively. owners is number of owners, transformed logarithmically to reduce skew. we also control for macro-level context in two respects, national wealth as gni per capita, and the elaboration of the national entrepreneurial eco-system, measured as the mean of the framework conditions measured by gem in its national expert survey (bosma 2013) . the population is the world's entrepreneurs, where a respondent is surveyed in a country. the data are thus hierarchical with two levels, individuals nested within countries. the country should be taken into account, both because level of activity, e.g. networking and innovation, differs among countries, and because behavior is similar within each country. these circumstances of country are taken into account in hierarchical linear modeling (snijders and bosker 2012) . hierarchical linear modeling is otherwise very similar to linear regression. notably, the effect of a condition is tested and estimated by a coefficient. hierarchical linear modeling is used in table 3 . the sample of 10,582 starting entrepreneurs is described by correlations, table 1 . furthermore, among the entrepreneurs, 9% were networking with a researcher or inventor, and 13% were networking with a potential investor. although these two kinds of networking are not common, they are not rare. these two kinds of networking tend to go hand in hand, unsurprisingly, and are also correlated with networking with others in the public sphere and networking in the private sphere, but none of these correlations are high. the correlations among variables of interest and between variables of interest and control variables are mostly weak, indicating that there is no problem of multicollinearity in the analysis. coupling of innovation and financing is high to the extent that innovation is high and financing is high. conversely, coupling is low when either of them is low. to see whether coupling is typical, we cross-tabulate the startups according to their innovation and their financing, table 2 . coupling is high in the startups where both innovation and financing is high, the bold-faced 12% in table 2 . conversely, coupling is low in the startups where either innovation or financing is low, the italicized 10 + 12 + 9 + 8 + 8%. in between, coupling is medium where one is medium and the other is medium or high, the 12 + 14 + 15% in table 2 . numerical independent variables are standardized, then centered within country the table does not clearly display a tendency for innovation and financing to go hand in hand. indeed, the correlation between financing and innovation is .06 (p < .0005). thus there is a weak tendency for innovation and financing to co-occur, a coupling that is loose rather than tight. coupling is affected by the various kinds of networks around the entrepreneur, we hypothesized. effects on coupling are estimated in the hierarchical linear model, table 3 . hypothesis 1 is that coupling is affected negatively by networking in the private sphere. the effect is tested in the first model in table 3 . this effect is negative, thus supporting hypothesis 1. hypothesis 2 is that coupling is affected positively by networking in the public sphere, with advisors other than investors and researchers. this effect is tested in the first model. the effect is positive, thus supporting hypothesis 2. hypothesis 3 is that coupling is affected positively by networking with potential investors. this effect is positive, thus supporting hypothesis 3. hypothesis 4 is that coupling is affected positively by networking with researchers. this effect is positive, supporting hypothesis 4. the effects of investors and of researchers are substantial, and the effects of networking in the public sphere and networking in the private sphere are of notable magnitude. hypothesis 5 is that coupling is affected positively by networking with investors together with researchers, as a synergy effect that is in addition to the separate effect of investors and the separate effect of researchers. this is tested by expanding the model by including the interaction term, the product of the dichotomy for networking with investors and the dichotomy of networking with researchers. the effect of the interaction is estimated in the second model in table 3 . the interaction effect is positive, thus supporting hypothesis 5. the effect is actually of a magnitude that is quite substantial. in short, the five hypotheses are all supported. the analyses have addressed the two research questions. what is the coupling between innovation and financing at inception? what is the embeddedness of coupling in the networks around the entrepreneur, specifically the networks with investors and researchers? the questions have been addressed by a survey of a globally representative sample of entrepreneurs at inception of their startup. the representativeness of sampling implies that findings can be generalized to the world's starting entrepreneurs. the next sections discuss our findings concerning, first, coupling as a phenomenon, and, second, embeddedness of coupling in networks. coupling as a phenomenon was found to be infrequent, in that a typical startup does not pursue both financing and innovation. often, a startup is either innovative or well financed. rather few startups are both highly innovative and well financed. across startups, innovativeness and financing are positively correlated, but only weakly, indicating that innovation and financing have a coupling that is loose rather than tight (section 4.2). coupling between innovation and financing is a capability. pursuing such coupling in a startup is building an organizational capability. coupling goes beyond the capability to innovate and goes beyond the capability to finance starting. coupling is a competitive advantage in the competition among startups, a competition to enter the market, survive, expand and grow. coupling in a startup correlates positively with expectation to grow (section 3.2.3), indicating the benefit to be expected from coupling, and thus indicating that coupling is a competitive advantage. it is theoretically surprising to find that coupling is so loose, when coupling is a competitive advantage. but empirically it is less surprising, when we bear in mind that, typically at inception, financing is not invested in innovation, but is invested in production. a loose coupling could also be caused by information asymmetry, where the entrepreneurs who have creative idea and innovation capability cannot be identified by the investors. such interpretation can find some evidence in the study by shu et al. (2018) . alternatively, it could also be the entrepreneurs who have financing capability or financial resources lack the incentives, energy or capability to polish their ideas or projects but are eager to start the new ventures. these could be the so-called necessitydriven or desperate entrepreneurial activities, which are in contrast to opportunitydriven actions (mühlböck et al. 2018; fernández-serrano et al. 2018) . the study by mühlböck et al. (2018) , using the data from the global entrepreneurship monitor (gem), has provided some evidence. they observed that many entrepreneurs sprung up during the outbreak of the economic crisis, but these businesses were started even without (or with a negative) perception of business opportunities and entrepreneurial skills. the authors term this phenomenon as "nons-entrepreneurship" driven by necessity, meaning there are no other options for a job but only to start their own businesses. usually in such cases, the institutional environment is favorable. besides, according to their findings, there is a considerable share of such individuals among early stage entrepreneurs. additionally, we suspect that the coupling is very loose at the inception, because the competitive advantage of coupling has not yet taken effect at inception. the coupling will have an effect only later, we expect, namely as the startup competes in the market, for survival, expansion and growth. therefore, coupling is appropriately considered strategic building of capabilities. future research may advance our knowledge regarding such loose coupling. from these findings, we may learn at least two practical lessons. first, financing and innovation do not go hand in hand at the inception, although this is actually important for a new venture to succeed. the failure rate of entrepreneurial firms is high mainly due to the resource scarcity and financial constrains (colombo et al. 2014) . such loose coupling, as discussed above, could be caused by low participation willingness of the capital owners or a lack of effective channels for two sides to identify/know each other. policymakers may give special attention to these, and design some mechanisms, set up the rules, or provide the supports to attract or guide the capital into the inception phase and reduce the potential problem of information asymmetry between two sides. besides, with aforementioned potential reason of the presence of necessity-driven entrepreneurs that causes the loose coupling, both policymakers and investors are suggested to distinguish between the necessity-(especially desperate) and opportunitybased entrepreneurs and take actions. as considered by mühlböck et al. (2018) and confirmed by fernández-serrano et al. (2018) , those desperate or necessity-driven entrepreneurs with a lesser feasibility and skills may be less successful and thus less beneficial for the economy than opportunity-driven entrepreneurs. second, entrepreneurs, and especially nascent entrepreneurs, should pay attention to create such coupling, and networking can be an efficient way as will be discussed below. a recent study by rezaeizadeh et al. (2017) found interpersonal skills for networking is one of the top competences that the entrepreneurs should possess, and they suggest such competence development be included in university education. meanwhile, they suggest that continuous training programs with a network of proactive peers, engaged academics, and a wider business community will help sustain and develop entrepreneurial intentions and behaviors, as well as expand the entrepreneurs' networks. below, we discuss the network influence in more detail. coupling is channeled, enabled and constrained by networks around the entrepreneur. on a broader level, we may say networking capability is one of the important organizational capabilities, especially in the increasingly knowledge-intensive and turbulent economic environment, since different networks represent different conduits of information and resources that the organization can constantly access. thereby, the organization can become more flexible and adaptive. as also advocated by windsperger et al. (2018, p.671) , entrepreneurial networks should be used by the firms "to complement their resources and capabilities in order to realize static and dynamic efficiency advantages". networking is typically thought of as inherently beneficialthe more, the merrierbut some networking may be a waste of time and energy, and some networking may even be detrimental, so networking has its "dark side" (klyver et al. 2011) . networking in the private sphere was here found to be detrimental for coupling, as hypothesized. this finding is consistent with earlier studies, showing negative effects of networking in the private sphere upon outputs such as innovation, exporting, and expectation for growth of the business (schøtt and sedaghat 2014; ashourizadeh and schøtt 2015; cheraghi et al. 2014 ). more generally, whereas networking in the private sphere is beneficial for legitimacy and emotional support (liu et al. 2019) , networking in the private sphere seems detrimental for outputs. network research should not presume that a network is homogenous (as presumed in the most common measure of an actor's social capital as number of contacts), but should distinguish between the dark side and the bright side of a network (klyver et al. 2011) . on the bright side, we found that an entrepreneur's networking in the public sphere i.e. in the workplace, professions, market, and international environmentis beneficial for coupling between innovation and financing. drawing advices from a wide spectrum in the public sphere, a wide spectrum of knowledgeable specialists (also apart from researchers and investors), enables the entrepreneur to combine various kinds of knowledge, information, and resources, which is beneficial for the simultaneous pursuits of innovation and financing. an entrepreneur's networking with a potential investor was also found to benefit the coupling between financing and innovation in the startup, as expected. as also expected, networking with a researcher benefits the coupling. over and above these two additive effects, coupling was found to be further enhanced by simultaneously networking with an investor and with a researcher, discerned as an interaction effect in a multivariate model. networking with an investor and networking with a researcher are not substitutable for one another, and their effects do not simply add up. rather, there is a synergy effect, a further enhancing effect over and above the two separate effects of networking with an investor and networking with a researcher. the theory of competitive advantage through structural holes in the network around an actor can help us understanding the synergy benefit (burt 1992a, b) . a focal actor has a structural hole in the network of contacts, when two contacts are not interrelated. the hole between the two implies that they cannot combine something from one with another thing from the other. the focal actor, however, can acquire something from one and another thing from the other, and can thereby combine the two things and, following schumpeterian thinking, the combination constitutes a competitive advantage in the competition among actors for new things. the literal meaning of 'entrepreneur' is going in between and taking a benefit, and in our study the entrepreneur is going between an investor and a researcher, and combining advice or investment from the former with advice or new idea from the latter, and thereby promotes a coupling of financing and innovation, a synergy that builds a capability and a competitive advantage. from the resource-based view (barney 1991; grant 1991) and the dynamic capabilities perspective (teece et al. 1997 ), a firm's resources and capabilities will determine its competitive advantage and value creation, and a firm needs to constantly adapt, renew, reconfigure and re-create its resources and capabilities to the volatile and competitive environment, so that a competitive advantage can be developed and maintained. however, the entrepreneurial firms, especially those at formation stage run by nascent entrepreneurs, usually lack the strategic resources and capabilities at the beginning, e.g., financial resources and financing capabilities, innovation resources and capabilities, business management skills, and have lesser competitive disadvantages. furthermore, the emergence and development of a new venture is a dynamic process with many uncertainties, requiring different resources, information, and knowledge at different time points (hayter 2016; steier and greenwood 2000) . different relationship networks, especially the professional ones discussed in this study, can provide new ventures with opportunities for continually accessing needed resources, forming a basis that enables coupling of financing and innovation, synergy creation from integrating various resources, develop and sustain the new venture's competitive advantages, and gain profit (see also davidsson and honig 2003; batjargal and liu 2004) . along the same lines, holding a relational governance view of competitive advantage, dyer and singh (1998) argue for the critical resources that enable the firm's competitive advantage to extend beyond firm boundaries and are embedded in inter-firm resources and routines, including such components as relation-specific assets, knowledge-sharing routines and complementary resources/capabilities. in summary, we may say networking and coupling capabilities are two crucial capabilities for the nascent entrepreneurs, on top of the others, for identifying, pursuing and creating market opportunities, and for attaining and sustaining the new ventures' competitive advantages. joining the discussion of the influence of strong vs. weak ties (or private vs. public networks) on the entrepreneurs, results of this study, falling in line with some of the research (granovetter 1973; davidsson and honig 2003; afandi et al. 2017) , further remind the entrepreneurs to be aware of potential detrimental effect of being overembedded in the private sphere network that is bringing information and resource redundancy and social obligation. rather, they are well advised to actively and judiciously pursue, develop, and maintain public sphere networking, especially the professional networks with the investors and researchers/inventors, which enable and promote the coupling between innovation and financing, and capability development in these regards. the entrepreneur network capability framework developed by shu et al. (2018) can be a good reference, four dimensions comprising network orientation, network building, network maintenance, and network coordination. network orientation should be in the first place, which means a person should be willing to develop and depend on social networks in own daily socialization, believe, pay special attention to and act on the norms of dependence, cooperation, and reciprocation. in terms of the orientation, as discussed, this study suggests the importance and benefits of widening and diversifying the entrepreneurs' social relations, especially being in and crossing different professional communities. however, most of the entrepreneurs may be not aware of this. for instance, a study of university spin-off by hayter (2016) found that early-stage academic entrepreneurs have their contacts mainly within academic communities that are typically located in their home institutions, and such homophilous ties would further constrain the entrepreneurial development. with clear orientation, the entrepreneur shall monitor surroundings and make effective investment to establish and expand the networks. however, as reminded by semrau and werner (2012) , it is not a good idea to extend the network size without boundaries because there is an opportunity cost of time and the cost can surpass the benefits that the networks can bring. our study further suggests that it is worthwhile to invest in developing the contacts at least in two communities, i.e., with capital holders and knowledgeable and new ideas generators, due to the unique and mutual-reinforced synergistic contributions to founding the new venture, as discussed earlier. it can happen that the nascent entrepreneurs have sufficient personal or family wealth to self-finance the start-up process. however, the entrepreneurs ought to remember that sometimes it is not the "capital" itself that makes the success of a new venture, but the capital-associated resources that help, i.e., from the sources providing capital. . the entrepreneurs ought to think about the other benefits that the investors could bring, such as commercialization competences, business management skills, reputation, more diverse network access, synergistic effect, as shown in several studies (van osnabrugge and robinson 2000; hsu 2004; mason and stark 2004; croce et al. 2017) . further, while network maintenance is to ensure stable and long-term exchange relationships with them, network cooperation is to manage multiple and dynamic relationships, and to mobilize and integrate resources. moreover, these results may also be relevant for well-established organizations that seek to enhance their innovation and financing capabilities and gain a competitive advantage, suggesting that strategically developing, managing and utilizing the bridging social ties may be an efficient way. at the individual level, this may encompass designing an incentive scheme and training program to improve the employees' entrepreneurial spirit, networking awareness and capability. at the organizational level, the firms should strategically manage inter-organizational relationships, both formal and informal, and build systems that can monitor the surroundings, and thereby identify and evaluate new business opportunities outside the organizational boundaries. relevant concepts, models and strategies can be, e.g., cooperative entrepreneurship (rezazadeh & nobari, 2018) and open innovation (enkel et al. 2009 ). as concluded by rezazadeh and nobari (2018) , cooperative entrepreneurship is likely to lead to improvement of firms' agility, customer relationship management, learning, innovation, and sensing capabilities. from a public policy perspective, the above results have important policy implications, stemming essentially from the contribution to innovation coming from networking with researchers, inventors and investors. if innovation and entrepreneurial businesses are important for economic development and for people's life, the study clearly suggests that public policy should be designed to encourage, facilitate and support business networking activities, researcher-business collaboration, and investorentrepreneur connections. besides, university education should be another focus by the policy-makers, since it can be an efficient way or a starting point to foster people's entrepreneurial spirits, develop the students' entrepreneurial competences, especially their networking and relationship management capabilities, and even provide some opportunities for them to develop their networks which may enable them to be an entrepreneur in the future. some strategies and models can be, as documented, the university-based incubation programs, entrepreneurship education programs, researchbased spin-off, and building entrepreneurial universities (clarysse and moray 2004; rothaermel et al. 2007; budyldina 2018) . our research design was to investigate coupling at inception of the startup. this design has the advantages of avoiding attrition when startups are abandoned and avoiding retrospection if interviews were to be conducted later. but the cross-sectional focus on inception implies that the fate of a startup and its coupling are unknown. coupling is presumably yielding a competitive advantage, but at inception this is not enacted. another limitation is that the data are from around 2014, so we have observed the same constraints confronted by other scholars of entrepreneur and entrepreneurship (e.g., mühlböck et al. 2018; fernández-serrano et al. 2018) . entrepreneurial behavior has changed since networking was surveyed by gem, and organizing is changing even more with the covid-19 pandemic. the limitations suggest further research on coupling. coupling appears important as a strategy for building capability and competitive advantage. therefore, an important research question is, what is the effect of coupling in a startup upon its ability to compete, survive, expand and grow? an indication of the effect of coupling upon growth was seen in the substantial correlation between coupling and expectation for growth of the business (section 3.2.3). but, of course, effects of coupling are far better ascertained through longitudinal research. the current covid-19 pandemic is an eco-systemic intervention that is changing competition and organizational behavior. based on our findings, we hypothesize current exits to be especially prevalent among entrepreneurs without coupling of financing and innovation, and we hypothesize that success is especially likely for entrepreneurs with a tight coupling between innovation and financing. such hypotheses may well be tested with some of the surveys that are underway in the wake of the pandemic. entrepreneurial alertness and new venture performance: facilitating roles of networking capability social capital and entrepreneurial process the role of family members in entrepreneurial networks: beyond the boundaries of the family firm the emerging paradigm of strategic behavior. strategic management journal exporting embedded in culture and transnational networks around entrepreneurs firm resources and sustained competitive advantage entrepreneurs' access to private equity in china: the role of social capital venture capital investments and patenting activity of high-tech start-ups: a micro-econometric firm-level analysis the global entrepreneurship monitor (gem) and its impact on entrepreneurship research the influence of self-efficacy on the development of entrepreneurial intentions and actions working the crowd: improvisational entrepreneurship and equity crowdfunding in nascent entrepreneurial ventures entrepreneurial universities and regional contribution structural holes the social structure of competition structural holes and good ideas innovation networks: from technological development to business model reconfiguration growth-expectations among women entrepreneurs: embedded in networks and culture in algeria, morocco, tunisia and in belgium and france a process study of entrepreneurial team formation: the case of a researchbased spin-off green start-ups and local knowledge spillovers from clean and dirty technologies to be born is not enough: the key role of innovative startups ownership structure, horizontal agency costs and the performance of high-tech entrepreneurial firms the impact of incubator' organizations on opportunity recognition and technology innovation in new, entrepreneurial high-technology ventures how business angel groups work: rejection criteria in investment evaluation the role of social and human capital among nascent entrepreneurs financing entrepreneurship: bank finance versus venture capital the relational view: cooperative strategy and sources of interorganizational competitive advantage entrepreneur behaviors, opportunity recognition, and the origins of innovative ventures networking behavior and contracting relationships among entrepreneurs in business incubators firm-level implications of early stage venture capital investment -an empirical investigation open r&d and open innovation: exploring the phenomenon innovation in innovation: the triple helix of university-industry-government relations efficient entrepreneurial culture: a cross-country analysis of developed countries the strength of weak ties economic action and social structure: the problem of embeddedness the resource-based theory of competitive advantage: implications for strategy formulation networks and entrepreneurship -an analysis of social relations, occupational background, and use of contacts during the establishment process constraining entrepreneurial development: a knowledge-based view of social networks among academic entrepreneurs what do entrepreneurs pay for venture capital affiliation? formation of industrial innovation mechanisms through the research institute theory of the firm: managerial behavior, agency costs and ownership structure components of the network around an actor characteristics, contracts, and actions: evidence from venture capitalist analyses social networks and new venture creation: the dark side of networks assessing the contribution of venture capital to innovation measuring emergence in the dynamics of new venture creation development and cross-cultural application of a specific instrument to measure entrepreneurial intentions women's experiences of legitimacy, satisfaction and commitment as entrepreneurs: embedded in gender hierarchy and networks in private and business spheres r&d networks and product innovation patterns-academic and nonacademic new technology-based firms on science parks what do investors look for in a business plan? a comparison of the investment criteria of bankers, venture capitalists and business angels opportunity evaluation and changing beliefs during the nascent entrepreneurial process desperate entrepreneurs: no opportunities, no skills the role of academic inventors in entrepreneurial firms: sharing the laboratory life loosely coupled systems: a reconceptualization what you know or who you know? the role of intellectual and social capital in opportunity recognition core entrepreneurial competencies and their interdependencies: insights from a study of irish and iranian entrepreneurs, university students and academics antecedents and consequences of cooperative entrepreneurship: a conceptual model and empirical investigation university entrepreneurship: a taxonomy of the literature venture capitalist-entrepreneur relationships in technology-based ventures. enterprise and innovation management studies gendering pursuits of innovation: embeddedness in networks and culture innovation embedded in entrepreneurs' networks and national educational systems: a global study the two sides of the story: network investments and new venture creation building networks into discovery: the link between entrepreneur network capability and entrepreneurial opportunity discovery multilevel analysis: an introduction to basic and advanced multilevel modeling entrepreneurship and the evolution of angel financial networks dynamic capabilities and strategic management venture capital's role in financing innovation for economic growth angel investing: matching startup funds with startup companies-the guide for entrepreneurs and individual investors engaging with startups to enhance corporate innovation educational organizations as loosely coupled systems governance and strategy of entrepreneurial networks: an introduction angel finance: the other venture capital. strategic change social cognitive theory of organizational management publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations key: cord-027304-a0vva8kb authors: achermann, guillem; de luca, gabriele; simoni, michele title: an information-theoretic and dissipative systems approach to the study of knowledge diffusion and emerging complexity in innovation systems date: 2020-05-23 journal: computational science iccs 2020 doi: 10.1007/978-3-030-50423-6_19 sha: doc_id: 27304 cord_uid: a0vva8kb the paper applies information theory and the theory of dissipative systems to discuss the emergence of complexity in an innovation system, as a result of its adaptation to an uneven distribution of the cognitive distance between its members. by modelling, on one hand, cognitive distance as noise, and, on the other hand, the inefficiencies linked to a bad flow of information as costs, we propose a model of the dynamics by which a horizontal network evolves into a hierarchical network, with some members emerging as intermediaries in the transfer of knowledge between seekers and problem-solvers. our theoretical model contributes to the understanding of the evolution of an innovation system by explaining how the increased complexity of the system can be thermodynamically justified by purely internal factors. complementing previous studies, we demonstrate mathematically that the complexity of an innovation system can increase not only to address the complexity of the problems that the system has to solve, but also to improve the performance of the system in transferring the knowledge needed to find a solution. formalization and organization of a network becomes strategic to accelerate the flow of information and knowledge and the emergence of innovation [2] . several forms of reticular organization (hierarchical, heterarchical, according to the centrality of elements, according to the transitivity of element, etc.) can be conceptualized within that context. evolutionary economics and technology studies highlight (neo-schumpeterian) models to understand the plurality of evolution cases, depending on the initial forms of organization, but also on the ability of a system to adapt to systemic crises. in this work we study, from an information-theoretical perspective, the relationship between the structure of an innovation network, the noise in its communication channels and the energy costs associated with the network's maintenance. an innovation network is here considered to encompass a variety of organisations who, through their interactions and the resulting relationships, build a system conducive to the emergence of innovation. this system is identified by the literature [3] with different terms, such as innovation ecosystem, [4] problem-solving network, [5] or innovation environment [6] . in this system, the information channels transfer a multitude of information and knowledge which, depending on the structural holes, [7, 8] but also on the absence of predetermined receivers, are subject to information "noise" [9] . the more the information is distorted in the network, the more energy is needed to transfer accurate information, in order to keep performance of the innovation network high. the idea we propose is that the structure of an innovation system evolves to address the heterogeneity in the quality of communication that takes place between its members. in particular, we argue that the noise in a network increases the complexity of the network structure required for the accurate transfer of information and knowledge, and thus leads to the emergence of hierarchical structures. these structures, thanks to their fractal configuration, make it possible to combine high levels of efficiency in the transmission of information, with low network maintenance costs. this idea complements previous studies that have analysed the relationship between the structure of an innovation network, on one hand, and the complexity of the problem to be solved and the resulting innovation process, on the other, [10] by focusing on communication noise and cost of network structure maintenance. to the existing understanding of this phenomenon we contribute by identifying a thermodynamically efficient process which the network follows as it decreases in entropy while simultaneously cutting down its costs. this model is based on the analysis of a network composed of two classes or categories of organisations, which operate within the same innovation system [11] . these classes are represented by a central organisation called seeker, which poses a research question to a group of other organisations, called problem-solvers, and from which in turn receives a solution. it has been suggested [12] that one of the problems that the innovation system has to solve, and for which it self-organises, is the problem of effective diffusion of knowledge between problem-solvers and solution-seekers, as this can be considered as a problem sui generis. the theory on the diffusion of knowledge in an innovation system suggests that this problem is solved through the evolution of modular structures in the innovation network, which implies the emergence of organisations that act as intermediary conduits of knowledge between hyperspecialised organisations in the same innovation environment [13] . a modular structure is, in network theory, connected to the idea of a hierarchical or fractal structure of the network, [14] and is also characterised by scale-invariance; [15] the latter is a particularly important property, because if innovation systems have it as an emergent property of their behaviour, this allows them to be considered as complex adaptive systems [16] . it has been suggested that scale-invariance property of an innovation system might emerge as the result of horizontal cooperation between its elements, [17] which try to reach the level of complexity required to solve a complex problem; but it is not yet clear how does a complex structure emerge when the complexity of the problem does not vary, which is a phenomenon observed empirically [18, 19] . in this paper we show how complexity can also vary as a result of a non-uniform distribution of the cognitive distance between organisations of the network, and of the adaptation required to solve the problem of knowledge diffusion among them. our contribution to the theoretical understanding on the self-organising properties of innovation systems is that, by framing the problem of heterogeneous cognitive distance between organisations under the theory of dissipative systems, we can explain in thermodynamically efficient terms the reduction in entropy of an innovation system, as an emergent adaptation aimed at reducing costs of maintenance of the system's structure. the theoretical framework which we use for this paper is comprised by four parts. first, we will frame the innovation system as a thermodynamically-open system, which is a property that derives from the fact that social systems also are [20] . second, we will see under what conditions a system can undertake self-organisation and evolution. this will allow us to consider an innovation system as a complex adaptive system, should it be found that there are emergent properties of its behaviour which lead to an increase in complexity. third, we will frame the innovation system as a dissipative system, which is a property also shared by social systems [21] . dissipative systems are characterised by the fact that a variation in the level of their entropy tends to happen as a consequence of their changed ability to process inputs, and we will see how this applies for innovation systems. lastly, we will study cognitive distance as it applies to a network of innovation, in order to show how a spontaneous reduction in it leads to an increase in complexity of the network. systems. an open thermodynamic system is defined as a system which exchanges matter and energy with its surrounding environment, [22] and among them are found all social systems, which are open systems due to their exchanging of energy with the surrounding environment [23] . social systems are also dynamical systems, because their structure changes over time through a process of dynamical evolution [24] . innovation systems are some special classes of social systems, [25] which can thus also be considered as open systems [26] . in addition to this, like all social systems, innovation systems are also capable of selforganisation, [27] which is a property that they inherit from social systems [28] . there is however a property which distinguishes innovation systems from the generic social system: that is, the capacity of the former to act as problem-solving environments [11] . an innovation system possesses the peculiar function of developing knowledge, [29] which is not necessarily possessed by the general social system [30] . it has been theorised that developing and distributing knowledge [31] is the method by which the innovation system implements the function of solving problems, [32, 33] and we will be working within this theoretical assumption. the innovation system, for this paper, is therefore framed as a thermodynamically-open social system which solves problems through the development and diffusion of knowledge. evolution and self-organisation. like all other social systems, [34] an innovation system undertakes evolution [35] and changes in complexity over time [36] . the change in complexity of a system, in absence of any central planning or authority, is called in the literature self-organisation [37] . self-organisation in a system implies that the system's components do not have access to global information, but only to information which is available in their immediate neighbourhood, and that upon that information they then act [28] . innovation systems evolve, with a process that may concern either their members, [38] their relationships and interactions, [39] the technological channels of communication, [40] the policies pursued in them, [41] or all of these factors simultaneously [42] . for the purpose of this work we will limit ourselves to consider as evolution of an innovation system the modification of the existing relationships between its members, and the functions which they perform in their system. this process of evolution of the innovation system is characterised by self-organisation, [43] and it occurs along the lines of both information [44] and knowledge flows within the system [45] . the selforganisation of an innovation system is also the result of evolutionary pressures, [46] and we will here argue that one form of such pressures is cognitive distance between organisations within a network of innovation, whose attempt at reduction may lead to modifications in the relationships within the system and to the emergence of complex structures. while it has also been suggested that variations in the complexity of an innovation system might be the consequence of intrinsic complexity of the problems to be solved, [47] it has also been suggested that problems related to the transfer of knowledge within the elements of the system can, by themselves, generate the emergence of complex network structures, through a process which is thermodynamically advantageous. dissipative innovation systems. as the innovation system acquires a more complex structure, its entropy decreases. if one assumes that the decrease in entropy follows the expenditure of some kind of energy by the system, without which its evolution towards a lower-entropy state is not possible, then it follows that the innovation system can be framed as a dissipative system. this is a consequence of the theory which, in more general terms, suggests that all social systems can be considered as dissipative systems; [48] and, among them, innovation systems can thus also be considered as dissipative systems [49] . the application of the theory of dissipative structures [50] to the study of social systems has already been done in the past, [51, 52] and it has also been applied to the study of innovation systems specifically, [53, 54] to understand the process by which new structures evolve in old organisational networks [55] . by framing the problem in this manner the emergence of a hierarchical structure in a dissipative innovation system can be considered as a process through which the innovation system reaches a different level of entropy in its structure, [56] by means of a series of steps which imply sequential minimal variations in the level of entropy of the system, and lead to the emergence of complexity [57] . cognitive distance as noise. the process of transferring knowledge between organisations presumes the existence of knowledge assets that are transferred [58] . companies embedded in an innovation system are therefore characterised by an intellectual or knowledge capital, [59] which is the sum of the knowledge assets which they possess, [60] and which in turn are the result of the individual organisation's path of development, [61] and of the knowledge possessed by the human and technological components of the organisation [62] . any two organisations do not generally share the same intellectual capital, and therefore there are differences in the knowledge assets which they possess, and in the understanding and representation which they create about the world. this difference is called "cognitive distance" in the literature on knowledge management, and it refers to the difficulty in transferring knowledge between any two organisations [63] . the theory suggests that an innovation network has to perform a trade-off between increasing cognitive distance between organisations, which means higher novelty value, and increasing mutual understanding between them, which gives higher transfer of knowledge at the expenses of novelty [64] . it has been argued that if alliances (that is, network connections) are successfully formed between organisations with high cognitive distance between their members, this leads to a higher production of innovation by that alliance, [65] as a consequence of the relationship between cognitive distance and novelty, as described above. it has also been argued that the measure of centrality of a organisation in an innovation network is a consequence of the organisation's impact on the whole knowledge governance process, with organisations contributing more to it located more centrally in the network [66] . we propose that this known mechanism might play a role in the dynamic evolution of an innovation system, in a manner analogous to that of noise in an information system. the idea is that an organisation generally possessing a lower cognitive distance between multiple components of a network might spontaneously become a preferential intermediary for the transfer of knowledge within the innovation system, and as a consequence of this a hierarchical network structure emerges out of a lower-ordered structure. 2 the structure of the network and its evolution the modeling of the process of evolution of a network of innovation is conducted as follows. first, we imagine that there are two different structures of the ego-network of an innovation seeker company that are the subject of our analysis. the first is a horizontal network, in which a seeker organisation is positioned in a network of solvers, which are all directly connected with the seeker organisation in question. the second is a hierarchical or fractal network, in which a structure exists that represents the presence of intermediaries in the transfer of knowledge between the seeker organisation and the solving organisations in the same network. all nodes besides the seeker organisation being studied in the first scenario, and all nodes at the periphery of the hierarchical structure of the second scenario, are from here on called solvers. there are n nodes in the ego-network of an innovation seeker company. the n nodes in the horizontal network are all solver nodes, while the n nodes in the hierarchical network are divided into two classes of nodes: the intermediaries comprised of m nodes, and the solvers, comprised of m 2 nodes (fig. 1) . in order to make the two network structures comparable we impose the additional condition that the total number of nodes in the two networks is the same, which is satisfied for n ¼ m 2 þ m. we also impose the additional condition that each of the n solver nodes in the periphery of the horizontal network has at least m link neighbours belonging to n, as this allows us to describe a dynamical process which leads from the horizontal network to the hierarchical network without the creation of new links. the hierarchical network always possesses a lower entropy than the horizontal network comprised of the same number of nodes. this can be demonstrated by using as a measure of entropy shannon's definition, [67] which calculates it as the amount of information required to describe the current status of a system, accordingly to the formula below: this measure of entropy can be applied to a social network by assigning the random variable x to the flattened adjacency matrix of the edges of the network, as done by others [68] . the adjacency matrices of the two classes of networks in relation to the size n + 1 of the same network are indicated in the tables below, for the specific values m ¼ 2 ! n ¼ 6 ( table 1) . in general, for any value m ! 2, if a horizontal network has n solver nodes, one seeker node is connected to all other n nodes, and all solver nodes are additionally connected to m solver nodes each, where m þ m 2 ¼ n. in a hierarchical network with m intermediary nodes and m 2 solver nodes, the seeker node is connected to m intermediary nodes, and each of the intermediary nodes is connected to m solver nodes. the general formulation of the adjacency matrix is indicated below, in relation to the value of m ( table 2 ). the adjacency matrices can be flattened by either chaining all rows or all columns together, in order to obtain a vector x which univocally corresponds to a given matrix. this vector has a dimensionality of n þ 1 ð þ 2 , having been derived from an n + 1 by n + 1 matrix. the vector x which derives from flattening can then be treated as the probability distribution over a random binary variable, and shannon's measure of entropy can be computed on it. for the horizontal network, the vector x horizontal has value 1 two times for each of the peripheral nodes because of their connection to the centre, and then again twice for each of the peripheral nodes. this means that the vector x horizontal corresponds to the probability distribution (2). for the hierarchical network, the vector x hierarchical has value 1 two times for each of the m intermediary nodes, and then 2 times for each of the m 2 nodes. the probability distribution associated with the vector x hierarchical is therefore (3) the hierarchical network systematically possesses a lower level of entropy than a horizontal network with the same number of nodes, as shown in the graph below (fig. 2) . since we consider the network as a dissipative system, the lower level of entropy implies an expected higher energetic cost of maintenance for the lower-entropy structure. it follows from this theoretical premise that the hierarchical network should either allow the system to receive a higher input, or emit a lower output, or both simultaneously, lest its structure would decay to a higher entropy form, the horizontal one. an innovation system which starts evolving from a horizontal structure would tend to develop a hierarchical structure as a solution to the problem of transfer of knowledge in a network where cognitive distance is not uniformly distributed, as we will see in this paragraph. this can be shown by considering the hierarchical network as an attractor for the dynamical evolution of a horizontal network, under condition that the cognitive distance between pairs of nodes is distributed non-uniformly. stationary states. for the context of this paper, as we model a finite-state network which operates on discrete time, which models the dynamics of a dissipation systems which evolves over time [69] . these functions have the form depicted below, with x k ð þ being the state of the system at time k, u k ð þ being the input to the system at k, and y k ð þ being the output of the system. if the system does not undertake change in its internal structure, having already reached a stationary state, then x k þ 1 ð þ¼x k ð þ. as we want to study whether the system spontaneously evolves from a horizontal to a hierarchical structure, we can assume that x k þ 1 ð þ¼f hiearchical x k ð þ; u k ð þ ð þ¼x k ð þ which can only be true if either the input u k ð þ is 0, which is not the case if the system is active, or if u k þ 1 ð þ¼u k ð þ. for the innovation system this condition is valid if minor variations in the structure of the network associated with it do not lead to a significant variation of the input to the system, which means that no advantages in the receipt by the seeker of solutions found by the solver should be found. if this is true, and if the hierarchical structure is an attractor for the corresponding horizontal network, then we expect the input of the horizontal network to increase as it acquires a modular structure and develops into a hierarchical network. input of the system. the input function of the system depends on the receipt by the seeker organisation of a solution to a problem found by one of the peripheral solver organisations, as described above. let us imagine that at each timestep the solver organisations do indeed find a solution, and that thus the input u k ð þ depends on the number of solver nodes, and for each of them on the probability of correct transmission of knowledge from them to the seeker organisation, which increases as the cognitive distance between two communicating nodes decreases. if this is true, then the input to the horizontal network is a function of the form u horizontal n k ; p k ð þ, where n is the number of solver nodes, and p is the cognitive distance in the knowledge transmission channel. similarly, the input to the hierarchical network is a function of the form u hierarchical m 2 k ; q k à á which depends on the m 2 solver nodes in the hierarchical network, and on the parameter q which describes the cognitive distance. n and m are such that as they increase so do, respectively, u horizontal and u hierarchical ; while p and q are such that, as they decrease, so do respectively u horizontal and u hierarchical increase. it can be also noted that it can then be argued that if p q then u horizontal [ u hiearchical , which means that the system would not evolve into a hierarchical network. it can also be noted that, if n and m 2 are sufficiently large, then lim n;m! þ 1 n m 2 à á ¼ 1 and therefore any difference between the number of solvers in the two network structures would not play a role in the input to the innovation system. from this follows that u hiearchical [ u horizontal ! q \ p; that is, that the input to the innovation system with a hierarchical structure is higher than the input to the innovation system with a horizontal structure, if the cognitive distance between the members of the former is lower than the cognitive distance between the members of the latter. output of the system. as per the output of the system, we can imagine that there is a cost to be paid for the maintenance of the communication channels from which the seeker receives solutions from the solvers. if the system is in a stationary state, the condition y k þ 1 ð þ¼y k ð þ must be valid, as it follows from the considerations that u k þ 1 ð þ¼u k ð þ. if the system is not in a stationary state, as the input to the system increases, so should the output, under the hypothesis of dissipative system described above. a graphical representation of the evolution of the system from higher to lower entropy state is thus presented below (fig. 3) . the seeker organisation would at each step receive a solution transferred by one of its link neighbours, with the indication of the full path through which the communication has reached it. the seeker would then pay a certain cost, an output with the terminology of dissipative systems, for the maintenance of the channel through which the solution has been transferred to it successfully. such channels increase in intensity or weight, and are more likely to be used in subsequent iterations. on the contrary, channels through which a solution has not been received in a given iteration are decreased in intensity or weight, and are less likely to be used in the future. a process such as the one described would eventually, if enough iterations are performed, lead to the withering links between nodes with higher cognitive distance, and to the preservation of links between nodes with a lower cognitive distance. new connections are not formed, because cognitive distance is considered to be an exogenous parameter in this model, which does not vary once the innovation system starts evolving. real-world phenomena are not characterised by this restriction, which should be considered when analysing real-world systems under this model. the originality of this paper consists in the framing of an innovation system under different theoretical approaches, such as that of thermodynamically-open systems, selforganisation and evolution, dissipative systems, and cognitive distance, which, when combined, highlight another way of understanding the overall operation and the evolution of innovation systems. from this perspective, the process which we here describe evolu on hierarchical network fig. 3 . evolution of a branch of the innovation network from a higher to a lower entropy structure, from left to right. the letters p and q define respectively a high and a low cognitive distance between peers. accounts for an emergent complexity of the innovation system, which can occur without central planning and on the basis of information locally available by its members. this seems to confirm the theory according to which innovation systems can self-organise to solve, among others, the problem of transfer of knowledge among their members. this seems also to suggest that, if the only form of proximity which matters is cognitive, and not geographical, organisational, or other, it might be possible to infer cognitive distance between the members of an innovation system on the basis of the way in which their relationships change over time. the theoretical prediction which this model allows to make is that, should a connection between members of an innovation system be preserved while others are dropped, this means that the cognitive distance between pairs of nodes with surviving connections is lower than that of other nodes in their ego-networks. the modelling of the evolution of an innovation system that we propose also shows that, if an innovation system starts its evolution with a centrally, highly-connected organisation in a largely horizontal network of solver, where the cognitive distance between each pair of nodes is not uniformly distributed, then the system would evolve towards a lower-entropy hierarchical structure, in order to solve the problem of transfer of knowledge from the organisations at the periphery of the innovation system to the central organisation. our finding is consistent with the theory on modularity as an emergent property of complex adaptive innovation systems. subsequent research might apply the mathematical model described in this paper to a longitudinal study of the evolution of real-world innovation networks, in order to test whether the theory related to the spontaneous emergence of a hierarchical structure of innovation networks can be empirically supported. on the theoretical plane, further research could expand the understanding of the evolution of an innovation network by adding considerations related to the role which geographical and organisational proximity have in the development of the network, and add these factors to the model proposed. issues related to perturbation of the network, limit cycle of its evolution, and self-organised criticality in connection to our model may also be explored in subsequent works. sociologie et épistémologie evolution and structure of technological systems -an innovation output network leveraging complexity for ecosystemic innovation what is an innovation ecosystem networks for innovation and problem solving and their use for improving education: a comparative overview value creation from the innovation environment: partnership strategies in university spin-outs structural holes: the social structure of competition knowledge management, intellectual capital, structural holes, economic complexity and national prosperity innovation as a nonlinear process, the scientometric perspective, and the specification of an 'innovation opportunities explorer the role of the organization structure in the diffusion of innovations innovation contests, open innovation, and multiagent problem solving network structure and the diffusion of knowledge the limits to specialization: problem solving and coordination in "modular networks hierarchical organization in complex networks scale-free and hierarchical structures in complex networks what is a complex innovation system? cooperation, scale-invariance and complex innovation systems: a generalization growing silicon valley on a landscape: an agent-based approach to high-tech industrial clusters developing the art and science of innovation systems enquiry: alternative tools and methods, and applications to sub-saharan african agriculture prigogine's model for self-organization in nonequilibrium systems the evolution of dissipative social systems the meaning of open systems social systems knowledge, complexity and innovation systems institutional complementarity and diversity of social systems of innovation and production thermodynamic properties in the evolution of firms and innovation systems networks, national innovation systems and self-organisation self-organisation and evolution of biological and social systems functions of innovation systems: a new approach for analysing technological change on the sociology of intellectual stagnation: the late twentieth century in perspective disciplinary knowledge production and diffusion in science a knowledge-based theory of the organisation-the problemsolving perspective thinking: a guide to systems engineering problem-solving social complexity: patterns, processes, and evolution the evolution of economic and innovation systems a complexity-theoretic perspective on innovation policy emergence versus self-organisation: different concepts but promising when combined the evolution of innovation systems understanding evolving universityindustry relationships innovation as co-evolution of scientific and technological networks: exploring tissue engineering from technopoles to regional innovation systems: the evolution of localised technology development policy perspectives on cluster evolution: critical review and future research issues innovation, diversity and diffusion: a selforganisation model social information and self-organisation self-organization, knowledge and responsibility the dynamics of innovation: from national systems and "mode 2" to a triple helix of university-industry-government relations the value and costs of modularity: a problem-solving perspective a dissipative network model with neighboring activation the analysis of dissipative structure in the technological innovation system of enterprises modern thermodynamics: from heat engines to dissipative structures lessons from the nonlinear paradigm: applications of the theory of dissipative structures in the social sciences self-organization and dissipative structures: applications in the physical and social sciences understanding organizational transformation using a dissipative structure model technological paradigms, innovative behavior and the formation of dissipative enterprises a dissipative structure model of organization transformation entropy model of dissipative structure on corporate social responsibility revisiting complexity theory to achieve strategic intelligence defining knowledge management: toward an applied compendium enterprise knowledge capital intellectual capital-defining key performance indicators for organizational knowledge assets the dynamics of knowledge assets and their link with firm performance management mechanisms, technological knowledge assets and firm market performance problems and solutions in knowledge transfer empirical tests of optimal cognitive distance industry cognitive distance in alliances and firm innovation performance the impact of focal firm's centrality and knowledge governance on innovation performance the mathematical theory of communication the physics of spreading processes in multilayer networks dissipative control for linear discrete-time systems key: cord-259634-ays40jlz authors: marcelino, jose; kaiser, marcus title: critical paths in a metapopulation model of h1n1: efficiently delaying influenza spreading through flight cancellation date: 2012-05-15 journal: plos curr doi: 10.1371/4f8c9a2e1fca8 sha: doc_id: 259634 cord_uid: ays40jlz disease spreading through human travel networks has been a topic of great interest in recent years, as witnessed during outbreaks of influenza a (h1n1) or sars pandemics. one way to stop spreading over the airline network are travel restrictions for major airports or network hubs based on the total number of passengers of an airport. here, we test alternative strategies using edge removal, cancelling targeted flight connections rather than restricting traffic for network hubs, for controlling spreading over the airline network. we employ a seir metapopulation model that takes into account the population of cities, simulates infection within cities and across the network of the top 500 airports, and tests different flight cancellation methods for limiting the course of infection. the time required to spread an infection globally, as simulated by a stochastic global spreading model was used to rank the candidate control strategies. the model includes both local spreading dynamics at the level of populations and long-range connectivity obtained from real global airline travel data. simulated spreading in this network showed that spreading infected 37% less individuals after cancelling a quarter of flight connections between cities, as selected by betweenness centrality. the alternative strategy of closing down whole airports causing the same number of cancelled connections only reduced infections by 18%. in conclusion, selecting highly ranked single connections between cities for cancellation was more effective, resulting in fewer individuals infected with influenza, compared to shutting down whole airports. it is also a more efficient strategy, affecting fewer passengers while producing the same reduction in infections. the network of connections between the top 500 airports is available under the resources link on our website http://www.biological-networks.org. complex networks are pervasive and underlie almost all aspects of life. they appear at different scales and paradigms, from metabolic networks, the structural correlates of brain function, the threads of our social fabric and to the larger scale making cultures and businesses come together through global travel and communication [1] [2] [3] [4] [5] [6] . recently, these systems have been modelled and studied using network science tools giving us new insight in fields such as sociology, epidemics, systems biology and neuroscience. typically components such as persons, cities, proteins or brain regions are represented as nodes and connections between components as edges [6] [7] . many of these networks can be categorised by their common properties. two properties relevant to spreading phenomena are the modular and scale-free organization of real-world networks. modular network consist of several modules with relatively many connections within modules but few connections between modules. scale-free networks with highly connected nodes (hubs) where the probability of a node having k edges follows a power law k −γ [8] [9] . it is possible for a network to show both scale-free and modular properties, however the two features may also appear independently. the worldwide airline network observed in this study was found to be both scale-free and modular [10] . spreading in networks is a general topic ranging from communication over the internet [11] [12], phenomena in biological networks [13] , or the spreading of diseases within populations [14] . scale-free properties of airline networks are of interest in relation to the error and attack tolerance of these networks [5] [15] . for scale-free networks, the selective removal of hubs produced a much greater impact on structural network integrity, as measured through increases in shortest-path lengths, than simply removing randomly selected nodes [15] . structural network integrity can also be influenced by partially inactivating specific connections (edges) between nodes [16] [17] [18] . dynamical processes such as disease spreading over heterogeneous networks was also shown to be impeded by targeting the hubs [19] [20] , with similar findings for highest traffic airports in the case of sars epidemic spreading [5] . in contrast to predictions of scale-free models, recent studies of the airline network [21] demonstrated that the structural cohesiveness of the airline network did not arise from the high degree nodes, but it was in fact due to the particular community structure which meant some of the lesser connected airports had a more central role (indicated by an higher betweenness centrality, the ratio of all-pairs shortest paths crossing each node). here we expand on this finding further by considering a range of centrality measures for individual connections between cities, show that their targeted removal can improve on existing control strategies [5] for controlling influenza spreading and finally discuss the effect of the community structure on this control. to demonstrate the impact on influenza spreading caused by topological changes to the airline network, we run simulations using a stochastic metapopulation model of influenza [22] [23] where the worldwide network of commercial flights is used as the path for infected individuals traveling between cities (see fig. 1a with mexico city as starting node of an outbreak). for this, we observe individuals within cities that contain one of the 500 most frequently used airports worldwide (based on annual total passenger number). individuals within the model can be susceptible (s), infected (i), or removed (r). the number of infected individuals depends on the population of each city and the volume of traffic over airline connections between cities. note that the time course of disease spreading will also be influence by seasonality [23] ; however, only spreading in one season was tested here. the simulated epidemic starts in 1 july 2007 from a single city, mexico city in our case, and its evolution over the following year is recorded. we then consider the number of days necessary for the epidemic to reach its peak as well as the maximum number of infected individuals ( fig. 2a) . this procedure is then repeated following the removal of a percentage of connections ranked as by a range of distinct measures such as edge betweenness centrality, jaccard coefficient or difference and product of node degrees. finally we also test the effect of shutting down the most highly connected airports (hubs) up to the same level of cancelled connections. comparing single edge removal strategies against the previously proposed shutdown of whole nodes (airports) we find that removing selected edges has a greater impact on the spreading of influenza with a significantly smaller loss of connectivity between cities. for the global airline network only a smaller set of flights routes between cities would need to be stopped instead of cancelling all the flights from a set of airports to get the same reduction in spreading. in addition as demonstrated in [21] for structural cohesiveness and in [24] regarding dynamical epidemic spreading, it is the community structure and not the degree distribution that plays a critical role in facilitating spreading. our method of slowing down spreading by removing critical connections is efficient as it targets links between such communities. concerning the computational complexity, whereas some strategies are computationally costly for large or rapidly evolving networks, several edge removal strategies are as fast as hub removal while still offering much better spreading control. note that whereas we observed similar strategies in an earlier study [25] , the current work includes the following changes: first, simulations run at the level of individuals rather than simulating whether the disease has reached ('infected') airports. second, the spreading between cities, over the airline network, now depends on the number of seats in airline connections between cities. this gives a much more realistic estimate of the actual spreading pattern as not only the existence of a flight connection but the specific number of passengers that flow over that link is taken into account. third, the previous study used an si model that is suitable for early stages of epidemic spreading. however, in this study we use an sir model that allows us to observe the time course of influenza spreading up to one year after the initial outbreak. for the network used in the study, the top 500 cities worldwide with the highest traffic airports became the nodes and an edge connects two of such nodes if there is at least one scheduled passenger flight between them. edges are then weighted by the daily average passenger capacity for that route. spreading in this network can then show how a disease outbreak, e.g. h1n1 or sars influenza, can spread around the world [5] [23] . as in previous studies [5] [26], we have used a similar methodology [22] where one city is the starting point for the epidemic and air travel between such cities offers the only transmission path for an infectious disease to spread between them. due to the relevance of the recent h1n1 (influenza a) epidemic we have used mexico city to be the epidemic starting point of our simulations. spreading simulations starting in mexico city with 100 exposed individuals were summarised by ninfectious the greatest number of infected individuals that were infectious at any time during the epidemic. spreading control strategies were evaluated by removing up to 25% of the flight routes and measuring the resulting decrease in ninfectious (see fig. 2a and methods). measures based on edge betweenness and jaccard coefficient were the two best predictors of critical edges (fig. 1a) . among the top intercontinental connections identified by betweenness centrality are flights from sao paulo (brazil) to beijing (china), sapporo (japan) to new york (usa) and montevideo (uruguay) to paris (france). after removing a quarter of all edges, both strategies showed a decrease in infected population of 37% for edge betweenness centrality and 23% for the jaccard coefficient, compared to only 18% for the hub removal strategy. (a) influenza spreading for mexico city as starting node, measured by the number of infected individuals over time on the intact network (blue) and after removing 25% of edges by hub removal (red) or edge betweenness (green). (b) maximum infected population following sequential edge elimination by betweenness centrality, jaccard coefficient, difference and product of degrees and hub removal (see methods). whereas in [23] a control strategy based on travel restrictions found that travel would need to be cut by 95% to significantly reduce the number of infected population, we observed that by removing connections ranked by edge betweenness this reduction to appeared after 18% of flight routes were cancelled (see fig. 2b ). to understand the underlying mechanism of these results we produced two rewired versions of the original network: one version preserved the degree distribution alone while another preserved both the latter and also the original community structure. applying the same spreading simulations on these rewired versions of the network showed that only on networks that preserved the original's community structure did we observe a significant reduction in infections when removing edges (see fig. 3 ) connecting nodes ranked by jaccard coefficient. for the 25% restriction level considered, betweenness centrality was the best measure even when no communities were present, offering a 41% reduction in infected cases in both types of network. this apparent advantage of betweenness even in networks without communities is due to its use of the capacity of each connection (edge weight), at 25% edge removal it will have removed most major high capacity connections from the network. jaccard is a purely structural measure and without knowledge of capacity. the presence of communities is then critical for its performance. at lower levels of damage we see that jaccard is better than edge betweenness centrality at reducing infected cases in networks with community structure. selecting specific edges for removal efficiently controls spreading in the airline network. although this was not tested directly, cancelling fewer flights might also lead to fewer passengers that are affected by these policies compared to the approach of cancelling mostly flights from highly connected nodes (hubs). with the same number of removed connections, edge removal strategies resulted in both a larger slowdown of spreading and a resulting much smaller number of infected individuals compared to hub removal strategies. edge betweenness was best at predicting critical edges that carried the greater traffic weighted by number of passengers traveling resulting in a large reduction in infectious population; however we also observed that removing edges ranked using the purely structural jaccard coefficient (see fig. 2a ) led to the greatest delay in reaching the peak of the epidemic. among the best predictor edge measures, due to a computational complexity of o(n 2 ), the jaccard coefficient is the fastest measure to calculate, making it particularly suitable for large networks or networks where the topology frequently changes. edge betweenness was the computationally most costly measure with o(n * e), for a network with n nodes and e edges. whereas hub removal was the worst strategy in this study, node centrality might lead to better results. indeed, previous findings [10] show that the most highly connected cities in the airline system do not necessarily have the highest node centrality. however, node centrality would be computationally as costly as edge betweenness. highly ranked connections predicted by edge measures were critical for the transmission of infections or activity and can be targeted individually with fewer disruptions for the overall network. in the transportation network studied, this means higher ranked individual connections could be cancelled instead of isolating whole cities from the rest of the world. results obtained from simulating the same spreading strategy over differently rewired versions of the airline network demonstrated the mechanism behind the performance of the jaccard predictor in slowing down spreading in networks that display a community structure, as is the case for spatially distributed real-world networks [27] [28] [29] . this is a good measure for these types of networks, given its good computational efficiency and the little information it requires to compute the critical links -it needs nothing else than to know the connections between nodes. the current study was testing different strategies and different percentages of removed edges leading to a large number of scenarios that had to be tested. therefore, several simplifications had to be performed whose role could be investigated in future studies. first, only one starting point, mexico city, for epidemics was tested. while this is in line with earlier studies using 1-3 starting points [5] [23] , it would be interesting to test whether there are exceptions to the outcomes presented here. second, spreading was observed only in one season, summer. previous work [23] has pointed out that the actual spreading pattern differs for different seasons. third, only the 500 airports with the largest traffic volume rather than all 3,968 airports were included in the simulation. while this was done in order to be comparable with the earlier study of hufnagel et al. [5] , tests on the larger dataset would be interesting. including airports with lower traffic volumes might preferable include national and regional airports within network modules. this could lead to a faster infection of regions; however, connections between communities would still remain crucial for the global spreading pattern. compared to our earlier study where the spreading of infection between airports rather than individuals was modelled [25] , edge betweenness could reduce the maximally infected population number more than targeting network hubs. the jaccard coefficient that showed very good performance in the earlier study [25] , however, did not perform better than the hub strategy. the difference and product of node degrees were poor strategies for both spreading models. this indicates that metapopulation models can lead to a different evaluation of flight cancellation strategies for slowing down influenza spreading. in conclusion, our results point to edge-based component removal for efficiently slowing spreading in airline and potentially other real-world networks. the network of connections between the top 500 airports is available under the resources link on our website http://www.biological-networks.org. note that distribution of the complete data set, including all airports and traffic volumes, is not allowed due to copyright restrictions. however, the complete dataset can be purchased directly from oag worldwide limited. as in other work [5] [10], we obtained scheduled flight data for one year provided by oag aviation solutions (luton, uk). this listed 1,341,615 records of worldwide flights operating from july 1, 2007 to july 30, 2008, which is estimated by oag to cover 99% of commercial flights. the records include the cities of origin and destination, days of operation, and the type of aircraft in service for that route. airports were uniquely identified by their iata code together with their corresponding cities. these cities became the nodes in the network. short-distance links corresponding to rail, boat, bus or limousine connections were removed from our data set. an edge connecting a pair of cities is present if at least one scheduled flight connected both airports. as in previous studies [5] , we used a sub-graph containing the 500 top airports that was obtained by selecting the airports with greater seat traffic combining incoming and outgoing routes. this subset of airports still represents at least 95% of the global traffic, and as demonstrated in [30] it includes sufficient information to describe the global spread of influenza. we are allowed to make the restricted data set of 500 airports available and you can download it under the resources link at http://www.biological-networks.org/ our analysis is based on the stochastic equation-based (seb) epidemic spreading model as used in [31] , simulating the spreading of influenza both within cities and at a global level through flights connecting the cities' local airports. within cities, a stochastically variable portion of the susceptible population establishes contact with infected individuals. this type of meta-population model accounts for 5 different states of individuals within cities: non-susceptible, susceptible, exposed, infectious, and removed (deceased). as we have not considered vaccination in this model we did not use the non-susceptible class in our study. movement of individuals between cities is determined deterministically from the daily average passenger seats on flights between cities. once infectious, individuals will not travel. we have assumed a moderate level of transmissibility between individuals, where r0 = 1.7, as also used in other influenza studies [31] [32] . note, however, that future epidemics of h5n1 and other viruses might have different in [5] a similar model including stochastic local dynamics was used, however it was focused on a specific outbreak of sars (severe acute respiratory syndrome) and hong kong was considered its starting point. five candidate measures for predicting critical edges in networks were tested. the measures are based on range of different parameters including node similarity, degree and all pairs shortest paths. measures are taken only once from the intact network and are not recomputed after each removal step. edge betweenness centrality [33] [34] represents how many times that particular edge is part of the all-pairs shortest paths in the network. edge betweenness can show the impact of a particular edge on the overall characteristic path length of the network; a high value reveals an edge that will quite likely increase the average number of steps needed for spreading. the jaccard similarity coefficient (or matching index [35] [36] ) shows how similar the neighbourhood connectivity structure of two nodes is, for example two nodes who shared the exact same set of neighbours would have the maximum similarity coefficient of 1. a low coefficient reveals a connection between two different network structures that might represent a "shortcut" between remote regions, making such low jaccard coefficient edges a good target for removal. the absolute difference of degrees for the adjacent nodes is another measure of similarity of two nodes. a large value here indicates a connection between a network hub a more sparsely connected region of the network. the product of the degrees of the nodes connected by the edge is high when both nodes are highly connected (hubs). for testing the absolute difference and product of degrees we also considered the opposite removal strategy (starting with lowest values) but the results showed to be consistently under-performing when compared to all other measures (not shown). finally, highly connected nodes will be detected and the nodes, and therefore all the edges of that node, will be removed from the network. note that this is referred to as 'hub removal strategy' whereas the impact is shown in relation to the number of edges which are removed after each node removal. original simulation code, as used in [23] , was obtained from the midas project (research computing division, rti international). the simulator was developed in java (sun microsystems, usa) programming language using the anylogic tm (version 5.5, xj technologies, usa) simulation framework to implement the dynamical model. network measures were implemented in custom matlab (r2008b, mathworks, inc., natick, usa) code. results were further processed in matlab . simulations were run in parallel on a 16-core hp proliant server, using the sun java 6 virtual machine. edge betweenness centrality was implemented using the algorithm by brandes [34] . links between cities in the network were considered to be directed, the network used included a total of 24,009 edges. mexico city was used as a starting node as observed in in the recent 2009 h1n1 pandemic. the starting date of the epidemic was assumed to be 1 july, and the pandemic evolution is simulated over the following 365 days, covering all the effects of seasonality as seen in both the southern and northern hemispheres. following the removal of each group of edges ranked by each control strategy, the spreading simulations were repeated. to test whether the mechanism of control arose from the particular community structure or degree distribution, we observed two different rewired versions of the original network. in one version only each individual node degree was maintained and the whole network was randomly rewired, destroying the original community structure. for the second, the original community structure was preserved but the sub-network within each community was rewired, so connections within the community were rearranged but the original inter-community links were preserved. both rewiring strategies preserved the original degree structure by the commonly used algorithm [37] in order to maintain the same number of passengers departing from each city and the number of passengers is only shuffled to different destinations. this way both strategies did not change in the number of passengers departing from each city, only the connectivity structure was modified. the original community structure was identified using an heuristic modularity optimization algorithm [38] which identified four distinct clusters. these are predominantly geographic: one for north and central america, including canada and hawaii, another for south america, a third including the greater part of china (except hong kong, macau and beijing) and finally a fourth including all other airports (fig. 1b) . twenty rewired networks were generated for each version of the rewiring algorithm and the daily average evolution of influenza, using the same spreading algorithm as above, was taken across these 20 networks. this was repeated after the removal of each group of edges. therefore each measure on each of the rewired lots combines 182,500 individual results. classes of small-world networks exploring complex networks statistical mechanics of complex networks the structure and function of complex networks forecast and control of epidemics in a globalized world scale-free networks: complex webs in nature and technology graph theory emergence of scaling in random networks villas boas pr. characterization of complex networks: a survey of measurements the worldwide air transportation network: anomalous centrality, community structure, and cities' global roles breakdown of the internet under intentional attack epidemic spreading in scale-free networks error and attack tolerance of complex networks attack vulnerability of complex networks edge vulnerability in neural and metabolic networks multiple weak hits confuse complex systems: a transcriptional regulatory network as an example infection dynamics on scale-free networks superspreading and the effect of individual variation on disease emergence modeling the world-wide airport network a mathematical model for the global spread of influenza controlling pandemic flu: the value of international air travel restrictions. plos one superspreading and the effect of individual variation on disease emergence reducing in fl uenza spreading over the airline network the role of the airline transportation network in the prediction and predictability of global epidemics modeling the internet's large-scale topology nonoptimal component placement, but short processing paths, due to long-distance projections in neural systems community analysis in social networks sampling for global epidemic models and the topology of an international airport network controlling pandemic flu: the value of international air travel restrictions. plos one strategies for mitigating an influenza pandemic a set of measures of centrality based on betweenness a faster algorithm for betweenness centrality computational methods for the analysis of brain connectivity graph theory methods for the analysis of neural connectivity patterns. neuroscience databases. a practical guide specificity and stability in topology of protein networks watts dj, strogatz sh. collective dynamics of 'small-world' networks on the evolution of random graphs we thank http://www.flightstats.com for providing location information for all airports and oag worldwide limited for providing the worldwide flight data for one year. supported by wcu program through the national research foundation of korea funded by the ministry of education, science and technology (r32-10142). marcus kaiser was also supported by the royal society (rg/2006/r2), the carmen e-science project (http://www.carmen.org.uk) funded by epsrc (ep/e002331/1), and (ep/g03950x/1). jose marcelino was supported by epsrc phd studentship (case/cna/06/25) with a contribution from e-therapeutics plc. the authors have declared that no competing interests exist. key: cord-163462-s4kotii8 authors: chaoub, abdelaali; giordani, marco; lall, brejesh; bhatia, vimal; kliks, adrian; mendes, luciano; rabie, khaled; saarnisaari, harri; singhal, amit; zhang, nan; dixit, sudhir title: 6g for bridging the digital divide: wireless connectivity to remote areas date: 2020-09-09 journal: nan doi: nan sha: doc_id: 163462 cord_uid: s4kotii8 in telecommunications, network sustainability as a requirement is closely related to equitably serving the population residing at locations that can most appropriately be described as remote. the first four generations of mobile communication ignored the remote connectivity requirements, and the fifth generation is addressing it as an afterthought. however, sustainability and its social impact are being positioned as key drivers of sixth generation's (6g) standardization activities. in particular, there has been a conscious attempt to understand the demands of remote wireless connectivity, which has led to a better understanding of the challenges that lie ahead. in this perspective, this article overviews the key challenges associated with constraints on network design and deployment to be addressed for providing broadband connectivity to rural areas, and proposes novel approaches and solutions for bridging the digital divide in those regions. in 2018, 55% of the global population lived in urban areas. further, 67% of the total world's population had a mobile subscription, but only 3.9 billion people were using internet, leaving 3.7 billion unconnected, with many of those living in remote or rural areas [1] . people in these regions are not part of the information era and this digital segregation imposes several restrictions to their daily lives. children growing up without access to the latest communication technologies and online learning tools are unlikely to be competitive in the job and commercial markets. unreliable internet connection also hinders people from remote areas to benefit from online commerce and engage in the digital world, thereby compounding already existing social and economic inequalities. abdelaali chaoub is with the national institute of posts and telecommunications (inpt), morocco (email: chaoub.abdelaali@gmail.com). marco giordani is with the department of information engineering, university of padova, padova, italy (email: giordani@dei.unipd.it). brejesh lall is with the indian institute of technology delhi, india (email: brejesh@ee.iitd.ac.in). vimal bhatia is with the indian institute of technology indore, india (email: vbha-tia@iiti.ac.in). adrian kliks is with the poznan university of technology's institute of radiocommunications, poland (email: adrian.kliks@put.poznan.pl). luciano mendes is with the national institute of telecommunications (inatel), brazil (email: luciano@inatel.br). khaled rabie is with the manchester metropolitan university, uk (email: k.rabie@mmu.ac.uk). harri saarnisaari is with the university of oulu, finland (email: harri.saarnisaari@oulu.fi). amit singhal is with the bennett university, india (email: singhalamit.iitd@gmail.com). nan zhang is with the department of algorithms, zte corporation (email: zhang.nan152@zte.com.cn). sudhir dixit is with the basic internet foundation and university of oulu (email: sudhir.dixit@ieee.org). however, rural areas are now becoming more and more attractive as the new coronavirus (covid-19) pandemic has shown, since it has reshaped our living preferences and pushed many people to work remotely from wherever makes them most comfortable [2] . such agglomerations where people live and work are referred to as "oases" in this paper. wireless connectivity in rural areas is expected to have a significant economic impact too. hence, the use of technology in farms and mines will increase the productivity and open new opportunities for local communities. technology will also provide better education, higher quality entertainment, increased digital social engagement, enhanced business opportunities, higher income, and efficient health systems to those living in the most remote zones. despite these premises, advances in the communication standards towards provisioning of wireless broadband connectivity to remote regions have been, so far, relegated to the very bottom, if not entirely ignored. the fundamental challenges are low return on investment, inaccessibility that hinders deployment and regular maintenance of network infrastructures, and lack of favorable spectrum and critical infrastructure such as backhaul and power grid, respectively. in these regards, despite being in its initial stages, the 6th generation (6g) of wireless networks is building upon the leftover from the previous generations [3] , and will be developed by taking into account the peculiarities of the remote and rural sector, with the objective of providing connectivity for all and reach digital inclusion [4] . specifically, the research community should ensure that this critical market segment is not overlooked in favor of the more appealing research areas such as artificial intelligence (ai), machine learning (ml), terahertz communications, 3d augmented reality (ar)/virtual reality (vr), and haptics. boosting remote connectivity can start by addressing spectrum availability issues. licenced spectrum in sub-1 ghz, in fact, is a cumbersome and costly resource, and may require new frequency reuse strategy in remote regions because of their unique requirements. utilization of locally unexploited frequencies and unlicensed bands judiciously may help in reducing the overall cost, thereby making remote connectivity a viable business opportunity. advanced horizontal and vertical spectrum sharing models, along with enhanced co-existence schemes, are two other powerful solutions to improve signal reach in these areas. innovative business and regulator models may be suitable, to encourage new players, such as community-based micro-operators, to build and operate the local networks. local, flexible and pluralistic spectrum licensing could be the way forward to boost the remote market. another issue is that remote areas may not have ample connectivity to the power sources. hence, it is imperative that 6g solutions for remote areas are designed as self-reliant in terms of their power/energy requirements, and/or with the capability to scavenge from the surrounding, possibly scarce, resources. governments can assuage this situation to an extent by making it attractive for the profit-wary service providers to deploy solutions in remote areas. revised government policies and appropriate business models should be parallelly explored as they have direct implications on the technology requirements. environmentally-friendly thinking should also be included throughout the chain of energy consumed from mining to manufacturing and recycling. moreover, the abundant renewable sources need to be integrated into power systems at all scales for sustainable energy provisioning. remote maintenance of network infrastructures and incorporation of some degrees of self-healing capability is also very important since it might be difficult to access remote areas due to a difficult terrain, harsh weather, or lack of transportconnectivity. suitable specifications for fault tolerance and fallback mechanisms need therefore to be incorporated. based on the above introduction, the objective of this article is two-fold: (i) highlight the challenges that hinder progress in the development and deployment of solutions for catering to the remote areas, and (ii) suggest novel approaches to address those challenges. in particular, the paper targets the 6g mobile standard, such that these important issues are considered into the design process from the very beginning. we deliberately skip a detailed literature survey, because a clear and comprehensive review is provided in [4] . we focus, rather, on discussing the requirements and the corresponding challenges, and proposing novel approaches to address some of those issues. a summary of these challenges and possible solutions is shown in fig 1. the rest of the article is organized as follows. sec. ii discusses the question of how future 6g can deliver affordable connectivity to remote users. sec. iii provides a range of promising technical solutions capable of facilitating access to broadband connectivity in remote locations. sec. iv promotes the use of a variety of dynamic spectrum access schemes and suggests how they can evolve to meet the surging needs in the unconnected areas. sec. v presents approaches for integrating infrastructure sharing, renewable sources, and emerging energy-efficient technologies to boost optimal and environmentally friendly power provision. sec. vi presents innovative ways to simplify maintenance operations in hardto-reach zones. finally, the conclusions are summarized in sec. vii. one of the biggest impediments to connecting the unconnected part of the world is the high costs involved and the prevailing low income of the target population. fortunately, there are many affordable emerging alternatives in 6g which may bring new possibilities, as enumerated in this section. dedicated remote-centred connectivity layer. besides 5g's typical service pillars (i.e., embb, ulrrc, and mmtc), 6g should introduce a fourth service grade with basic connectivity target key performance indicators (kpis). however, this remote mode cannot be just a plain version of the urban 6g, since it has to be tailored to the specificities of the remote sector. some kpis relevant to remote connectivity scenarios like coverage and cost-effectiveness need to be expanded, whereas the new service class needs more relaxed constraints in terms of some conventional 5g performance metrics like throughput and latency. this novel service class should have its dedicated slice and endowed with specific and moderate levels of edge and caching capabilities: the involved data can then be processed on edge, local or central data centers for better scalability, as illustrated in fig. 2 . accordingly, such connectivity services can be charged at reduced prices. local access in remote areas can be designed to aggregate multiple and heterogeneous rats. remote streams can then be split over one or more rats, thus allowing flexibility and providing the highest performance possible at minimal cost in everyday life and work. at the same time, digitalization in remote areas calls for large coverage solutions (e.g., tv or gsm white spaces (wss)) to increase the number of users within a base station and helps reduce the network deployment and management costs, albeit at some performance trade-offs. radio frequency (rf) solutions can be complemented by the emerging optical wireless communications (owcs). in particular, short range visible light communications (vlcs) category operating over the visible spectrum can boost the throughput in indoor, fronthaul and underwater environments (see fig. 3 ) while serving the intuitive goal of illumination making it a cost-efficient technology. low-cost networking and end-user devices. one way to reduce cost is the exploitation of legacy infrastructure. tv stations can be shared with mobile network operators (mnos) to provide both tower and electricity. the latest developments in wireless communications can be applied in outdoor power line communication (plc) to provide high data rate connectivity over the high and medium voltages power lines, increasing the capability of the backhaul networks in remote areas. existing base stations and the already-installed fibers alongside roads or embedded inside electrical cables can also serve as a backhaul solution for connectivity in rural regions. end-user devices and modems should also be affordable and usable everywhere, i.e., when people move or travel to different places under harsh conditions. therefore, the possibility to use off-the-shelf equipment at both the user's and network's sides is important and integration with appropriate software stacks is welcome to reduce capital and operational expenditures (capex and opex). the remote infrastructure is likely to be deployed by small internet service providers (isps) and the cost of specialized hardware equipment is an issue to be overcome. open source approaches allow mnos to choose common hardware from any vendor and implement the radio access network (ran) and core functionalities using software defined radio (sdr) and software defined networking (sdn) frameworks. moreover, virtualized and cloudified network functions may reduce infrastructure, maintenance and upgrade costs [5] . these solutions are especially interesting for new players building the remote network from scratch, to foster the inter-operability and cost-effectiveness of hardware and software. however, this field still requires further research and development work before commercial deployment. in remote areas in order to provide long-lived broadband connectivity, a minimum service quality must be continuously guaranteed. in this perspective, this section reviews potential solutions to promote resilient service accessibility in rural areas. multi-hop network elasticity. the access network has, over generations, become multi-hop to provide flexibility in the architecture design, despite some increase in complexity. given the typical geographic, topographic, and demographic constraints of present scenarios, performance levels (e.g., coverage, latency, and bandwidth) of individual hops can be made adaptive. the idea is to extend performance elasticity beyond air-interface to include other hops in the ran. the same approach can be brought to backhaul connections (see fig. 2 ). similarly, rural cell boundaries experiencing poor coverage can reap the elasticity benefits through the use of device-to-device (d2d) communications as depicted in fig. 3 . network protocols should be extended to include static-(e.g., location-based) besides temporal-quality adaptation to handle variations in channel quality over time. wireless backhaul solutions. service accessibility in rural areas involves prohibitive deployment expenditures for network operators and requires high-capacity backhaul connections for several different use cases. fig. 2 provides a comprehensive overview of potential backhaul solutions envisioned in this paper to promote remote connectivity. on one side, laying more fiber links substantially boost broadband access in those areas, but at the expense of increased costs. plc connections, on the other side, provide ease of reach at lower costs making use of ubiquitous wired infrastructures as a physical medium for data transmission, but some inherent challenges related to harsh channel conditions and connected loads are still to be overcome. fig. 2 illustrates also how, even though the use of conventional microwave and satellite links can fulfill the performance requirements of hard-to-reach zones, emerging long-range wireless technologies, such as tv and gsm ws systems, are capable of delivering the intended service over longer distances with less power while penetrating through difficult terrain like mountains and lakes. another recent trend is building efficient cost-effective backhaul links using software-defined technology embedded into off-the-shelf multi-vendor hardware to connect the unconnected remote communities (e.g., oasis 1 in fig. 2) . recently, the research community has also investigated integrated access and backhaul (iab) as a solution to replace fiber-like infrastructures with self-configuring easier-to-deploy relays operating through wireless backhaul using part of the access link radio resources [6] . for example, the tv ws tower in fig. 2 may use the tv spectrum holes to provide both access to oasis 3 and connection to the backhaul link for oasis 4. iab has lower complexity as compared to fiber-like networks and facilitates site installation in rural areas where cable buildout is difficult and costly. the potential of the iab paradigm is magnified when wireless backhaul is realized at millimeter waves (mmwaves), thus exploiting a much larger bandwidth than in sub-6-ghz systems. moreover, mmwave iab enables multiplexing the access and backhaul data within the same bands, thereby removing the need for additional hardware and/or spectrum license costs. nowadays, free space optical (fso) links are being considered as a powerful full-duplex and license-free alternative to increase network footprint in isolated areas with challenging terrains. however, fso units are very sensitive to optical misalignment. for instance, the hop1 fso unit depicted in fig. 2 should be permanently and perfectly aligned with the fso unit installed in the hop3 location. in-depth research in spherical receivers and beam scanning is hence needed to improve the capability of intercepting laser lights emanating from multiple angles. physical-layer solutions for front/mid/backhaul. even though wireless backhauling can reduce deployment costs, service accessibility in rural regions still requires a minimum number of fiber infrastructures to be already deployed. fiber capacity can hence be increased if existing wavelength division multiplexing networks are migrated to elastic optical networks (eons) by technology upgradation at nodes; the outdated technology of urban regions may then be reused to establish connectivity in under-served rural regions without significant investment. besides backhaul, midhaul and fronthaul should also be improved by ai/ml-based solutions providing cognitive capabilities for prudent use of available licensed and unlicensed spectrum [7] . this is especially useful in remote areas where the sparse distribution of users may result in spectrum holes. the unlicensed spectrum, in particular, can provide significant cost-savings for service delivery and improve network elasticity. new possibilities including evolved multiple access schemes and waveforms, like non-orthogonal multiple access (noma) for mmtc, should be investigated; this technology is particularly interesting for internet of things (iot) services where some sensors are close to and some far away from a base station [8] . ai/ml can be also exploited to control physical and link layers for smooth and context-aware modulation and coding schemes (mcss) transitions, even though this approach would need to be lightweight to reduce cost and maintenance, and optimized for the intended market segment. non-terrestrial network solutions. network densification in rural areas is complicated by the heterogeneous terrain that may be encountered when installing fibers between cellular stations. to solve this issue, 6g envisions the deployment of non-terrestrial networks (ntns) where air/spaceborne platforms like unmanned aerial vehicles (uavs), high altitude platform stations (hapss), and satellites, provide ubiquitous global connectivity when terrestrial infrastructures are unavailable [9] . potential beneficiaries of this trend are shown in 3 , including inter-regional transport, farmlands, ships, mountainous areas, and remote maintenance facilities. the evolution towards ntns will be favored by architectural advancements in the aerial/space industry (e.g., through solidstate lithium batteries and gallium nitride technologies), new spectrum developments (e.g., by transitioning to mmwave and optical bands), and novel antenna designs (e.g., through reconfigurable phased/inflatable/fractal antennas realized with metasurface material). despite these premises, however, there are still various challenges that need to be addressed, including those related to latency and coverage constraints. ntns can also provide remote-ready, low-cost (yet robust), and longrange backhaul solutions for terrestrial devices with no wired backhaul. self-organizing networks (sons). to explicitly address the problem of network outages (e.g., due to backhaul failure), which are very common in remote locations, 6g should transition towards sons implementing network slicing, dynamic spectrum management, edge computing, and zero-touch automation functionalities. this approach provides extra degrees of freedom for combating service interruptions, and improves network robustness. in this context, ai/ml can help both the radio access and backhaul networks to self-organize and selfconfigure themselves, e.g., to discover each other, coordinate, and manage the signaling and data traffic. we now present some promising solutions to address spectrum availability issues, which currently pose a serious impediment to broadband connectivity in remote areas. leveraging cognitive radio networks. one of the major barriers for network deployment in rural areas is spectrum licensing, since participation in spectrum auction is typically difficult, from an economic point of view, for small isps. in this perspective, new licensing schemes can prosper the cognitive radio approach, allowing local isps to deploy networks in areas where large operators are not interested in providing their service [10] . spectrum awareness mechanisms, e.g., geolocation database and spectrum sensing, can be used to inform network providers about vacant spectrum in a given area, as well as providing protection against unauthorized transmissions and unpredictable propagation conditions. for instance, fig. 3 shows how tv and gsm ws towers can expand the connectivity beyond the rural households to reach more distant locations like farms and wilderness areas. spectrum co-existence. sub-6 ghz frequencies remain critical for remote connectivity thanks to their favourable propagation properties and wide reach. in these crowded bands, spectrum re-farming and inter/intra-operator spectrum sharing can considerably increase spectrum availability [10] . nevertheless, coverage gaps and low throughput in the legacy bands call for advanced multi-connectivity schemes to combine frequencies above and below 6 ghz. using advanced carrier aggregation techniques in 6g systems, the resource scheduling unit can choose the optimal frequency combination(s) according to service requirements, device capabilities, and network conditions. the proposed model offers a scalable bandwidth that maintains service continuity in case of connectivity loss in those spectrum bands that are more sensitive to surrounding relief, atmospheric effects, and water absorption: for example fig. 3 illustrates a scenario in which vital facilities in rural communities enjoy permanent connectivity using the lower bands in case of communication failure on the higher bands. likewise, multi-connectivity provides diversity, improved system resilience, and situation awareness by establishing multiple links from separate sources to one destination. this aggregation can be achieved at various protocol and/or architecture levels ranging from the radio link up to the core network, allowing effortless deployments of elastic networks in areas difficult to access. utilizing unlicensed bands. a combination of licensed and unlicensed bands has been acknowledged by many stan-dardization organizations to improve network throughput and capacity in unserved/under-served rural areas, as depicted in fig. 3 . while the fcc has recently released 1.2 ghz in the precious 6 ghz bands to expand the unlicensed spectrum, the huge bandwidth available at millimeter-and terahertzwave bands will further support uplink and downlink split, in addition to hybrid spectrum sharing solutions that can adaptively orchestrate network operations in the licensed and unlicensed bands. high frequencies require line of sight (los) for proper communication, complicating harmonious operation with lower bands. accordingly, time-frequency synchronization, as well as control procedures and listening mechanisms, like listen-before-talk (lbt), need to evolve towards more cooperative and distributed protocols to avoid misleading spectrum occupancy. the management of uncoordinated competing users in unlicensed bands will emerge as important issue, and it needs to be addressed in 6g networks. regional licenses and micro-operators. deployment of terrestrial networks for remote areas is challenging due to terrain, lack of infrastructure and personnel. network operators would then rather roam their services from telecommunication providers already operating in those areas than building their own infrastructure. however, such an approach may entail the need for advanced horizontal (between operators of the same priorities) and vertical (when stakeholders of various priorities coexist) spectrum/infrastructure sharing frameworks. solutions like license shared access (lsa, in europe) and spectrum access system (sas, in the us) are mature examples of such an approach with two-tiers and three-tiers of users, respectively. this can evolve to include n-tiers of users belonging to m different mnos. an example of a fourtiered access is provided in fig. 3 . from the top, we find the e-safety services with the highest priority, a tier-2 layer devoted to e-learning sessions and e-government transactions, a middle-priority tier-3 layer for iot use cases that generate sporadic traffic, and a final lower-priority tier-4 layer that uses the remainder of the available spectrum (e.g., for ecommerce services). such solutions, however, need to be supported by innovative business and regulatory models to motivate new market entrants (e.g., micro-operators, which are responsible for last-mile service delivery and infrastructure management) to offer competitive and affordable services in remote zones [11] . power supply is among the highest expenses of mnos and a major bottleneck for ensuring reliable connectivity in remote areas. mnos' profitability and reliable powering can be improved following (a combination of) these solutions, as summarized in fig. 4 . infrastructure sharing. local communication/power operators, as well as various stockholders such as companies, manufacturers, governmental authorities and standardization bodies, should build an integrated design which entails a joint network development process right from the installation phase. in particular, the different players should cooperate to avoid deploying several power plants for different use cases, thus saving precious (already limited) economic resources for other types of expenses. efficient and optimal energy usage. the 6g remote area solutions should be energy efficient and allow base and relay stations to minimize power consumption while guaranteeing affordable yet sufficient service for residents [12] . in particular, energy efficiency should target iot sensors' design and deployment, since the increasing use of a massive number of iot devices, e.g., to boost farming and other activities such as environmental monitoring, is expected to significantly increase in the near future. at the moment, these efforts have been made after the standardization work was completed, but 6g should include efficient use of energy during the standardization process itself. techniques like cell zooming relying on power control and adaptive coverage can be reused at various network levels for flexible, energy-saving front/mid/backhaul layouts. ai/ml techniques can be very helpful in these scenarios. for example, the traffic load statistics on each node can be monitored to choose the optimal cell sleeping and on/off switching strategies to deliver increased power efficiency in all the involved steps of communication. technological breakthroughs. in addition to obvious energy sources such as solar, wind, and hydraulics, energy harvesting through the ambient resources (e.g., electromagnetic signals, vibration, movement, sound, and heat) could provide a viable efficient solution by enabling energy-constrained nodes to scavenge the energy while simultaneous wireless information and power transfer [13] . another recent advancement promoting energy-efficient wireless operations is the use of intelligent reflecting surfaces (irss), equipped with a large number of passive elements smartly coordinated to reflect any incident signal to its intended destination without necessitating any rf chain. although still in its infancy, this technology offers significant advantages in making the propagation conditions in harsh remote areas more favorable with substantial energy savings [14] . vi. intelligent and affordable maintenance operations, administration and management (oam) functionalities and dedicated maintenance for each network component are of paramount importance to overall system performance and user experience in traditional commercial 4g/5g networks. this comes at the expense of complicated and costly tasks, especially in hard-to-reach areas. in this section we present innovative ideas to enable intelligent and cost-effective maintenance in 6g network deployed in rural regions. network status sensing and diagnosing. traditionally, the oam system is adopted for network status monitoring with a major drawback, i.e, manual post-processing and reporting time delay due to huge amounts of gathered data. to enable an intelligent and predictive maintenance, network diagnostics relying on ai-based techniques is advised [15] . with the development of edge computing technologies, multi-level sensing can be employed to achieve near-real time processing and multi-dimensional information collection within a tolerable reporting interval. for instance, processing operations related to short-term network status could be mostly done at the edge node to ensure fast access to this vital information in rural zones. network layout planning and maintenance. as mentioned in the previous sections, the network in remote areas is mainly composed of cost-effective nodes along the path from the access to the core parts (e.g., radio, centralized and distributed units, iab-donors and relays) that need to be organized in either single or multiple hops. in this situation, the whole system will be harmed if one of these nodes experiences an accidental failure. to enhance the resilience of such networks, more flexible and intelligent network layout maintenance is required. more precisely, using evolved techniques such as sons (see sec. iii), the link among each couple of nodes within the network can be permanently controlled and dynamically substituted or restored in case of an outage (see fig. 5 ). additionally, since a big part of the next generation mobile network is virtualized, appropriate tools or even a dedicated server may be needed for automatic software updates monitoring, periodic backups and scheduled maintenance to avoid or at least minimize the need for on-site intervention in those remote facilities. automatic fallback mechanisms can also be scheduled to downgrade the connectivity to another technology under bad network conditions, e.g., by implementing appropriate multi-connectivity schemes, as described in sec. iv. network performance optimization. network optimization in rural areas should take into account remote-specific requirements and constraints. for example, access to the edge resources, which are finite and costly and can be rapidly exhausted, should be optimized taking into consideration the intended services, terminal capabilities, and charging policy of the network and its operator(s). a summary of the maintenance life cycle in remote and rural areas is shown in fig. 5 . in particular, after intelligently building and processing relevant system information data sets, maintenance and repair activities (e.g. system updates or operational parameters optimization) can be performed remotely and safely using 3d virtual environments such as ar and vr. the problem of providing connectivity to rural areas will be a pillar of future 6g standardization activities. in this article we discuss the challenges and possible approaches to addressing the needs of the remote areas. it is argued that such service should be optimized for providing a minimum fallback capability, while still providing full support for spatiotemporal service scalability and graceful quality degradation. we also give insights on the constraints on network design and deployment for rural connectivity solutions. we claim that optimally integrating ntn and fso technologies along the path from the end-point to the core element using open software built on the top of off-the-shelf hardware can provide low-cost broadband solutions in extremely harsh and inaccessible environments, and can be the next disruptive technology for 6g remote connectivity. integration of outdated technology should also be provisioned so that they may be innovatively used to service the remote areas. such provisions should extend to integrate open and off-the-shelf solutions to fully benefit from cost advantage gains. spectrum, regulatory, and standardization issues are also discussed because of their importance to achieve the goal of remote area connectivity. it is fair to say that including remote connectivity requirements in the 6g standardization process will lead to a more balanced and universal social as well as digital equality. 6g white paper on connectivity for remote areas the covid-19 pandemic and its implications for rural economies toward 6g networks: use cases and technologies a key 6g challenge and opportunity -connecting the base of the pyramid: a survey on rural connectivity wireless personal communications integrated access and backhaul in 5g mmwave networks: potential and challenges the roadmap to 6g: ai empowered wireless networks application of non-orthogonal multiple access in wireless sensor networks for smart agriculture a comprehensive simulation platform for space-airground integrated network 5g technology: towards dynamic spectrum sharing using cognitive radio networks business models for local 5g micro operators closing the coverage gap -how innovation can drive rural connectivity a critical review of roadway energy harvesting technologies towards smart and reconfigurable environment: intelligent reflecting surface aided wireless network mechanical fault diagnosis and prediction in iot based on multi-source sensing data fusion rabat (morocco) since 2015. his research interests are related to spectrum sharing for 5g/b5g networks, cognitive radio networks, smart grids, cooperative communications in wireless networks, and multimedia content delivery. he is a paper reviewer for several leading international journals and conferences. he has accumulated intersectoral skills through work experience both in academia and industry as a senior voip solutions consultant at alcatel-lucent information engineering in 2020 from the university of padova, italy, where he is now a postdoctoral researcher and adjunct professor. he visited nyu and toyota infotechnology center he is currently a professor in the department of electrical engineering at indian institute of technology delhi. previously, he served in the digital signal processing group of hughes software systems for 8 years. his research interests lie in the areas of signal processing and machine learning. he has extensively applied signal processing / machine learning techniques he received his ph.d. degree from institute for digital communications at the university of edinburgh (uoe), uk in 2005. during his ph.d. studies he also received the ieee fellowship for collaborative research on at carleton university, ottawa, canada. he has authored/co-authored more than 230 peer-reviewed journals and conferences sm] is an assistant professor at poznan university of technology's institute of radiocommunications, poland, and he is a cofounder and board member of rimedo labs company. his research interests include new waveforms for wireless systems applying either non-orthogonal or noncontiguous multicarrier schemes, cognitive radio, advanced spectrum management, deployment and resource management in small cells since 2001 he is a professor at the national institute of telecommunications (inatel), brazil, where he acts as research coordinator of the radiocommunications reference center higher education academy, received his ph.d. degree from the university of manchester, uk. he is currently an assistant professor at the manchester metropolitan university, uk. his primary research focuses on various aspects of the nextgeneration wireless communication systems. he received the best student paper award at the ieee isplc (tx, usa, 2015) and the ieee access editor of the month award for sm] received his ph.d degree from the university of oulu in 2000, where he has been with centre for wireless communications since 1994. he is currently a university researcher and his current research interest include remote area connectivity amit singhal [m] received his phd degree in electrical engineering from the indian institute of technology delhi in 2016. he is currently working as an assistant professor at bennett university, greater noida, india. his research interests include next generation communication systems, fourier decomposition method, image retrieval and molecular communications he received the bachelor degree in communication engineering and the master degree in integrated circuit engineering from tongji university he was a distinguished chief technologist and cto of the communications and media services for the americas region of hewlett-packard enterprise services in palo alto, ca, and the director of hewlett-packard labs india in palo alto and bangalore. before joining hp, he worked for blackberry, nsn, nokia and verizon. he has published 8 books, over 200 papers and holds 21 us patents key: cord-125979-2c2agvex authors: mata, ang'elica s. title: an overview of epidemic models with phase transitions to absorbing states running on top of complex networks date: 2020-10-05 journal: nan doi: nan sha: doc_id: 125979 cord_uid: 2c2agvex dynamical systems running on the top of complex networks has been extensively investigated for decades. but this topic still remains among the most relevant issues in complex network theory due to its range of applicability. the contact process (cp) and the susceptible-infected-susceptible (sis) model are used quite often to describe epidemic dynamics. despite their simplicity, these models are robust to predict the kernel of real situations. in this work, we review concisely both processes that are well-known and very applied examples of models that exhibit absorbing-state phase transitions. in the epidemic scenario, individuals can be infected or susceptible. a phase transition between a disease-free (absorbing) state and an active stationary phase (where a fraction of the population is infected) are separated by an epidemic threshold. for the sis model, the central issue is to determine this epidemic threshold on heterogeneous networks. for the cp model, the main interest is to relate critical exponents with statistical properties of the network. the present paper briefly reviews the modeling and theory of non-equilibrium dynamical systems on networks. a key class of non-equilibrium process are those that exhibit absorbing states, i.e. states from which the dynamics cannot escape once it has fallen onto them. a relevant feature of systems that presents absorbing states is a non-equilibrium phase transitions among an active state, in which the activity lasts forever in the thermodynamic limit, and an absorbing state, in which activity is absent [1, 2] . the same type of transition occurs in epidemic spreading processes [3] since a fully healthy state is absorbing in the above sense, provided that spontaneous birth of infected individuals is not allowed. the susceptibleinfected-susceptible (sis) [4] model and the contact process (cp) [5] represent some of the simplest epidemic models possessing an absorbing-state phase transition. lattice systems that exhibit such absorbing state phase transitions have universal features, determined by conservation laws and symmetries which allow to group them in a same universality class 1 [6] . the most robust class of absorbing state phase transitions is the directed percolation that was originally introduced as a model for directed random connectivity [7] . both cp and sis models are interacting particle systems involving self-annihilation and catalytic creation of particles that presents an absorbing-phase transition and thus belong to directed percolation class. the sis dynamics is indeed the most studied model to describe epidemic spreading on networks. although the cp was initially thought as a toy model for epidemics, lately it has been widely used as a generic reaction-diffusion model to study phase transition with absorbing states. other epidemic model that also presents an absorbing phase transition is the susceptible-infected-recoveredsusceptible model (sirs) [4] . it is an extension of the standard sis model, allowing a temporary immunity of nodes. both sis and sirs models are equivalent from the mean-field theory perspective, but the mechanism of immunization changes the behavior of the epidemic dynamics depending on the heterogeneity of the network structure. the susceptible-infected-recovered (sir) model is another example of epidemic models with permanent immunity, it means that a recovered node can no longer return to the susceptible compartment, so the system present many absorbing states since each configuration that have only susceptible and recovered nodes is absorbing [8, 9] . in the face of this context, we reviewed the sis and cp models as examples to investigate absorbing phase transitions in complex networks. we firstly explain, in section ii, some basic concepts related to complex networks required to understand the main idea of this paper. then we describe both epidemic models in section iii and, in the section iv, we present distinct theoretical approaches devised for them. in section v, we described some commonly used simulation techniques to analyze both models numerically. for the sis model, the central issue is to determine an epidemic threshold separating an absorbing, disease-free state from an active phase on heterogeneous networks [10] [11] [12] [13] [14] [15] [16] [17] [18] . while for the cp model, most of the interest is to relate the critical exponents with statistical properties of the network, in particular the degree distribution [19] [20] [21] [22] [23] [24] . in sections vi and vii we present a discuss about these points related to cp and sis models, respectively. finally, in section viii we draw our final comments. network analysis is a powerful tool that provide us a fruitful framework to describe phenomena related to real-world complex systems. here we will describe just some features of complex networks that it will be used throughout the paper. we will also present the uncorrelated configuration model (ucm) [25] , the substrate that will be used to model the dynamics of the epidemic process on networks. we can represent a network by means of an adjacency matrix a. a graph of n vertices has a n × n adjacency matrix. the edges can be represented by the elements a ij of this matrix such that [26] a ij = 1, if the vertices i and j are connected 0, otherwise, for a undirected and unweighted graph. in this case, the adjacency matrix is symmetric, it means a ij = a ji . a relevant information gives from the adjacency matrix is the degree k i of a vertex i defined as the number of links that the vertex i has, i.e., the number of nearest neighbors of the vertex i. the degree of the vertices can be written as [27] when it concerns to very large systems a suitable description can be done by means of statistical measures as the degree distribution p (k). the degree distribution provides the probability that a vertex chosen at random has k edges [28, 29] . the average degree is an information that can be extracted from p (k) and it is given by the average value of k over the network, it means, similarly, it can be useful to generalize and calculate the n-th moment of the degree distribution [27] k n = k k n p (k). (4) we can classify networks according to their degree distribution. the basic classes are homogeneous and heterogeneous networks. the first ones exhibit a fast decaying tail, as for example, a poisson distribution. here the average degree value corresponds to the typical value in the system. heterogeneous networks exhibit heavy tail that can be approximated by a power-law decay, p (k) ∼ k −γ . in this kind of network, the vertices often have a small degree, but there is a non-negligible probability of finding nodes with very large degree values thus, depending on γ, the average degree does not represent any characteristic value of the distribution [27] . many network model have been created in order to describe real systems. the advantage of using a model is to reduce the complexity of the real world to a level that one can be treated, for example, from the perspective of mean-field approach. in this context, uncorrelated random graphs are important from a numerical point of view, since we can test the behavior of dynamical systems whose theoretical solution is available only in the absence of correlations. for this propose, catanzaro and collaborators [25] presented an algorithm to generate uncorrelated random networks with power law degree distributions, called uncorrelated configuration model (ucm) as described below. to construct this networks we started with a set of n disconnected vertices. each node i is signed with a number k i of stubs, where k i is a random variable with distribution p (k) ∼ k −γ under the restrictions k 0 ≤ k i ≤ n 1/2 and i k i even. it means that no vertex can have either a degree smaller than the minimum degree k 0 or larger than the cutoff k c = n 1/2 . the network is constructed by randomly choosing two stubs and connecting them to form links, avoiding both self and multiple connections [25] . it is possible to show that [30] , to avoid correlations in the absence of multiples and self-connections, the random network must have a structural cutoff scaling at most as k c (n ) ∼ n 1/2 . as said previously, this algorithm is very useful in order to check the accuracy of many analytical solutions of dynamical process on networks. because of that, it was chosen as a substrate for implementing the dynamics of the sis and cp models in this review work. in the sis epidemic model, each vertex i of the network can be only one of two states: infected or susceptible. let us assume the most general case where a vertex i becomes spontaneously healthy at rate µ i , and transmits the infection to each one of its k i neighbors at rate λ i . for classical sis, one has µ i = µ and λ i = λ for every vertex [9] . as in the sis model, vertices in the cp model can be infected or susceptible, which in reaction-diffusion system's jargon are called occupied and empty, respectively. the spontaneous cure process is exactly the same as in the sis model: infected vertices become susceptible at rate µ i = µ. however, the infection is different. an infected vertex tries to transmit the infection to a randomly chosen neighbor at rate λ, implying that the transmission rate of vertex i is λ i = λ/k i , where k i is the number of neighbors of the i-th node. this reduces drastically the infective power of very connected vertices in comparison with the sis dynamics. both sis and cp dynamics exhibit a phase transition between a disease-free (absorbing) state and an active stationary phase where a fraction of the population is infected. originally, these regions are separated by an epidemic threshold λ c [8, 9] . the density of infected nodes ρ is the standard order parameter that describes this phase transition, as shown in figure 1 . however, for a finite system the unique true stationary state is the absorbing state, even above the critical point, due to dynamical fluctuations. to overcome the difficulty to study the active state of finite systems some simulation strategies were proposed in the literature [2, 18, 31, 32] , as we will show in section v c. the usual behavior of the density of infected nodes ρ in function of the control parameter λ, in a epidemic model as sis or cp in the thermodynamic limit. the value λc is the epidemic threshold that separates an absorbing state to an active phase with ρ > 0. when we study epidemic processes running on the top of heterogeneous networks a more complex behavior can emerge. indeed, the accurate theoretical understanding of epidemic models running on the top of complex networks rates among the hottest issues in the physics community [10, 11, 14, [16] [17] [18] [19] [20] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] . much effort has been devoted to understand the criticality of the absorbing state phase transitions observed in cp [19] [20] [21] [22] 36] and sis [10, 11, 14, 16-18, 35, 37-42] models, mainly based on perturbative approaches (first order in ρ) around the transition point [10, 11, 16, 17, 19, 34] . in the following we will present the basic mathematical approaches for the epidemic dynamics. the prediction of disease evolution can be conceptualized within a variety of mathematical approaches. all theories aim at understanding the properties of epidemics in the equilibrium or long term steady state, the existence of a non-zero density of infected individuals, the presence or absence of a threshold, etc. we will start with the simplest case namely the homogeneous meanfield theory, and thereafter we will review other more sophisticated mathematical approaches. the simplest theory of epidemic spreading assumes that the population can be divided into different compartments according to the stage of the disease (for example, susceptible and infected in both sis and cp models) and within each compartment, individuals (vertices in the complex networks' jargon) are assumed to be identical and have approximately the same number of neighbors (edges), k ≈ k . the idea is to write a time evolution equation for the number of infected individuals i(t), or equivalently the corresponding density ρ(t) = i(t)/n , where n is the total number of individuals. for example, the equation describing the evolution of the sis model is [43] : considering uniform infection and cure rates (λ i = λ and µ i = µ, ∀ i = 1, 2, · · · n ). the first term on the righthand side of the eq. (5) refers to the spontaneous healing and the second one to the infection process, that is proportional to the spreading rate λ k , the density of the susceptible vertices is 1 − ρ(t) that may become infected, and the density of infected nodes ρ(t) in contact with any susceptible individual. note that the evolution of the sis model is completely described by eq. (5), since the density of susceptible individuals is s(t)/n = 1−ρ(t). the mean-field character of this equation comes from the fact that the correlations among different nodes were neglected. thus, the probability that one infected vertex is connected to a susceptible one is approached as near the phase transition between an absorbing state and an active stationary phase we can assume that the number of infected nodes is small ρ(t) 1. in this regime, we can use a linear approximation neglecting all ρ 2 terms 2 . so the eq. (5) becomes 3 : the solution is ρ(t) ∼ e −(1−λ k )t , implying that ρ = 0 is an stable fixed point for 1 − λ k > 0. thus, one obtains, here, λ c is the epidemic threshold such that for any infection rate above this value the epidemic lasts forever [43] . similar analysis can be done for the cp model. in this case, the homogeneous mean-field equation read as since the transmission rate of each node is λ/ k . performing the same linear stability analysis in the steadystate, one obtains λ c = 1. in this framework, one considers that the connectivity patterns among individuals are homogeneous, neglecting the highly heterogeneous structure of the contact network inherent to real systems [28] . many biological, social and technological systems are characterized by heavy tailed distributions of the number of contacts k of an individual (the vertex degree), characterized by a power law degree distribution, p (k) ∼ k −γ . for such systems the homogeneity hypothesis is severely violated [27, 28, 44] . complex networks are, in fact, a framework where the heterogeneity of the contacts can be naturally afforded [28] . indeed, this heterogeneity plays the main role in determining the epidemic threshold. to take it into account other approaches have been proposed, as we showed in the next sections. the major aim is to understand how the epidemic spreading can be strongly influenced by the topology of networks. an important and frequently used mean-field approach to describe epidemic dynamics on heterogeneous networks is the quenched mean-field (qmf) theory [10] , that explicitly takes into account the actual connectivity of the network through its adjacency matrix. the central idea is to write the evolution equation for the probability ρ i (t) that a certain node i is infected. for the sis model the dynamical equation for this probability takes the form [10] : where a ij is the adjacency matrix. the first term on the right-hand side considers nodes becoming healthy spontaneously while the second one considers the event that the node i is healthy and gets the infection via a neighbor node. performing a linear stability analysis around the trivial fixed point ρ i = 0, one has where the jacobian matrix is δ ij being the kronecker delta symbol. the transition occurs when the fixed point loses stability or, equivalently, when the largest eigenvalue of the jacobian matrix is υ m = 0 [45] . the largest eigenvalue of l ij is given by since a ij is a real non-negative symmetric matrix, the perron-frobenius theorem states that one of its eigenvalues is positive and greater than, in absolute value, all other eigenvalues, and its corresponding eigenvector has positive components. so, one obtains the epidemic threshold of the sis model in a qmf approach [10] : where λ m is the largest eigenvalue of the adjacency matrix. for the cp dynamic, the eq. (9) becomes, performing the same linear stability analysis around the trivial fixed point, as was done for the sis model, one obtains where the jacobian matrix is given by once again the transition point is defined when the absorbing phase becomes unstable or, equivalently, when the largest eigenvalue of l ij is null [45] . the largest eigenvalue of l ij is given by now, supported by the perron-frobenius theorem [44] , we conclude that the largest eigenvalue of c ij is λ m = 1 resulting in the transition point λ c = 1, as obtained in a homogeneous approximation. returning to the sis model, the equation (12) can be complemented with the results of chung et. al. [46] who calculated the largest eigenvalue of adjacency matrix of networks with a power law degree distributions as where k c is the degree of the most connected node and k n is the n-th moment of the degree distribution. since k c grows as a function of the network size for any γ and k 2 diverges for 2 < γ < 3, the central result of equation (16) is: λ m diverges for enlarging networks with power law degree distributions even when k 2 remains finite [46] . therefore, the epidemic thresholds scale as [47] which vanishes for any power-law degree distribution. the reasons for this difference of λ c predicts by qmf approach, for γ larger or smaller than 5/2 are explained in ref. [47] . in processes allowing endemic steady-states, the activation mechanisms depend on the degree of heterogeneity of the network. for γ > 5/2 the hub sustains activity and propagates it to the rest of the system while for γ < 5/2 the innermost network core collectively turns into the active state maintaining it globally [47] [48] [49] . however, the behavior of the sis model on random networks with power-law degree distribution can be much more complex than previously thought. we will discuss these different possible scenarios in section vii. although mean-field theories are a simplified description of models, it is expected that they correctly predicts the behavior of dynamical process on networks, due to its small-world property. however, dynamical correlations are not taken into account since the states of a node and its neighbors are considered independent. one can consider dynamical correlations by means of a pairapproximation [2] in which the dynamic of an individual is explicitly influenced by its nearest neighbors as we showed in the section iv d 1. in the degree-based theories, called heterogeneous mean-field (hmf) theory, dynamical quantities, as the density of infected individuals, depend only of the vertex degree and do not of their specific location in the network. actually, the hmf theory can be obtained from the qmf one performing a coarse-graining where vertices are grouped according to their degrees. to take into account the effect of the degree heterogeneity we have to consider the relative density ρ k (t) of infected nodes with a given degree k, i.e., the probability that a node with k links is infected. again using the sis model as an example, the dynamical mean-field rate equation describing the system can thus be written as [9] : the first term on the right-hand side considers nodes becoming healthy at unitary rate. the second term considers the event that a node with k links is healthy and gets the infection via a nearest neighbor. the probability of this event is proportional to the infection rate λ, the number of connections k and the probability that any neighbor vertex is infected p (k |k)ρ k . the linearization of eq. (18) gives where l kk = −δ kk +λkp (k |k). therefore, the epidemic threshold is where λ m is the largest value of c kk = kp (k |k). it is difficult to find the exact solution for λ m for a general form of p (k |k). but it is possible to extract the value of the epidemic threshold. in the case of uncorrelated networks, p (k |k) = k p (k )/ k and c kk = k kp (k )/ k . so, it is easy to check that v k = k is an eigenvector with eigenvalue k 2 / k that, according to perron-frobenius theorem, is the largest. thus, we obtain the epidemic threshold: equation (21) has strong implications since several real networks have a power law degree distribution p (k) ∼ k −γ with exponents in the range 2 < γ < 3 [28] . for these distributions, the second moment k 2 diverges in the limit of infinite sizes implying a vanishing threshold for the sis model or, equivalently, the epidemic prevalence for any finite infection rate. both theories hmf and qmf predict vanishing thresholds for γ < 3 despite of different scaling for 5/2 < γ < 3. however, hmf predicts a finite threshold for networks with γ > 3 unlike the qmf theory that still predicts a vanishing threshold [47] . for the cp model, the heterogeneous mean-field equation, analogous to eq. (18), can be written as again we assume degree-uncorrelated networks and a simple linear stability analysis shows the presence of a phase transition, located at the value λ c = 1, as found in homogeneous mean-field theory, independent of the degree distribution and degree correlations. according to simulations [19, [21] [22] [23] [24] 36 ] the transition point does not quantitatively reproduced the predictions of both approaches, hmf and qmf. however, the advantage of hmf over the qmf theory, is that we can analytically obtain the critical exponents for the dynamical model and compare with numerical results. indeed, some works have shown that the contact process running on the top of highly heterogeneous random networks is well-described by the heterogeneous mean-field theory [22, 23] . however, some important aspects such as the threshold and strong corrections to the finitesize scaling observed in simulations are not clarified in this theory. we summarized the intense scientific discussion [19, 20, 22-24, 50, 51] about this subject in section vi. improvement of both hmf and qmf theories including dynamical correlations by means of a pairapproximation do not change qualitatively the results, however they promote a quantitative refinement, as we showed hereinafter. in the paper [16] , the authors investigated the role of dynamical correlations on the dynamic of the sis epidemic model on different substrates. we can start rewriting eq. 9 as follows: where a ij is the adjacency matrix and φ ij represents the probability of a pair of nodes i and j of being in the state φ ij = [s i i j ], this means the node i is susceptible and its neighbor j is infected. when we uses a simple approximation, this joint probability was factorized: the first three terms represents the processes related to the pair of neighbors i and j, spontaneous annihilation reactions the other terms represent processes related to the interaction with the other neighbors of i and j, this means another vertex l that can infect i or j: (23) and (24) cannot be solved due to the triplets. in turn, the dynamical equations for triplets will depend on quadruplets, and so on. then we have to break these correlations in some point. to obtain an pair-approximation solution, we can apply the standard pair-approximation [1, 52] : after a few steps (details are in reference [16] ), similar to what we did in the one-vertex mean-field approximaation as to perform a linear stability analysis around the fixed point ρ i = [s i i j ] = [i i i j ] = 0 and a quasi-static approximation for for t → ∞, dρ i /dt ≈ 0 and dφ ij /dt ≈ 0, in eqs. (23) and (24), we can find the jacobian matrix as in the case of one-vertex qmf theory, the critical point is obtained when the largest eigenvalue of l ij is nul. analytical solution for simple networks as random regular networks, star an wheel graphs can be directly obtained from eq. 26. for power law arbitrary random networks, as ucm model, the largest eigenvalue of eq. 26 can be numerically determined. in reference [16] , mata and ferreira showed that the thresholds obtained in pair and one-vertex qmf theories have the same scaling with the system size but the pair qmf theory is quantitatively much more accurate than the one-vertex theory when compared with simulations. in reference [38] the authors also studied the impact of dynamic correlations on the sis dynamics on static networks. performing the same analysis for the contac process, one obtains the jacobian matrix: . the critical point is also obtained when its largest eigenvalue is null. a general analytical expression is not available for large random power-law degree networks but, in principle, it can be obtained numerically for any kind of network. however, for simple graphs as a random regular network, the transition point can be obtained after some algebraic manipulations and it is given by: where m is the degree of all nodes of the network. that is the same value yield by the simple homogeneous pairapproximation [2] . for the sis model, the equation for the probability that a vertex with degree k is occupied takes the form where the conditional probability p (k |k), which gives the probability that a vertex of degree k is connected to a vertex of degree k , weighs the connectivity between compartments of degrees k and k and φ kk represents a pair of nodes with degree k and k respectively, in the state [s k i k ]. the dynamical equation for φ kk is the one-vertex mean-field equation [eq. (18) ] is obtained factoring the joint probability φ kk ≈ (1 − ρ k )ρ k in eq. (29) . the factor k − 1 preceding the first summation in eq. (30) is due to the k neighbors of middle vertex except the link of the pair [0 k 0 k ] (similarly for k − 1 preceding the second summation). following the same line of reasoning as the previous calculations, we now approximate the triplets in eq. (30) with the pair-approximation of eq. (30), performing a linearization around the fixed point ρ k ≈ 0 and φ kk ≈ 0, and performing a quasi-static approximation for t → ∞, in which dρ k /dt ≈ 0 and dφ kk /dt ≈ 0. considering uncorrelated random networks, we obtain the jacobian l kk with δ kk being the kronecker delta symbol. again, the absorbing state is unstable when the largest eigenvalue of l kk is positive. therefore, the critical point is obtained when the largest eigenvalue of the jacobian matrix is null, thus obtaining: this threshold coincides with that of the susceptibleinfected-recovered (sir) model in a one-vertex hmf theory [27] . this results was also proposed in ref. [18] using heuristic arguments. they argued that, dynamical correlations, that are neglected in a one-vertex hmf theory, account for the fact that infected nodes have higher probability to be still infected. this means that, in the next step, this node can be considered immunized (recovered for a while). so, a better upper bound for the spreading of the disease is given by the hmf theory of the sir model, that is exactly the threshold given by eq. (32) . in a similar way, we can perform the same analysis for the cp model and we obtain the jacobian: assuming that the network does not present correlated degree, we have p (k |k) = k p (k )/ k . so, the u k = k is an eigenvector of c kk with eigenvalue since c kk > 0 is irreducible and u k > 0, the perron-frobenius theorem [44] warranties that λ is the largest eigenvalue of c kk . the critical point is given by −1+λ = 0, it means we have to solve the transcendent equation numerically to obatin the transition point for any kind of uncorrelated degree network. using a random regular network as an example, this means, p (k) = δ k,m , where m is the degree of all nodes of the network, we obtain again: that is the same of the homogeneous and quenched pair mean-fiel theory. in reference [53] the authors have also determined the critical exponents in the pair hmf approach for the cp model. for the infinite size limit the exponents are the same as the one-vertex theory, as expected. however, the finite-size corrections to the scaling obtained in the pair hmf theory allowed a remarkable agreement with simulations for all degree exponents (2.0 ≤ γ ≤ 3.5) investigated, suppressing a deviation observed for low degree exponents in the one-vertex hmf theory [23] . numerical simulations is an essential tool to predict the accuracy of mean-field approaches in the study of epidemic processes on complex networks. although this tool is widely used, strict implementations of epidemic processes on networks with high heterogeneity on degree distribution are not simple. in reference [54] , the authors showed that, depending on the network properties, the threshold of the sis model can be altered when occur modifications in the sis dynamics, even preserving the basic properties of spontaneous healing and potencial of infection of each vertex growing unlimitedly with its degree. the classical algorithm to model continuous-time markov processes is known as gillespie algorithm [55] . in this recipe, we associated each dynamical transition (infection and healing for the sis or cp model, for example) with a poisson process, this mean, independent spontaneous processes. at each change of state, we have to update a list containing all possible spontaneous processes. however, for very large networks, this is computationally unfeasible. so, we used an optimized gillespie algorithm proposed by cota and ferreira [56] . in the following sections, we summarized these optimized algorithms for the sis and cp dynamics. for details of how to implement optimized algorithms of continuous-time markovian processes based on gillespie algorithm see reference [56] . the sis dynamics in a network of size n can be simulated in a very simple way: select a vertex at random with equal chance. if the selected vertex i is infected we turn it to susceptible with probability where k max is maximal number of connections, n i = j a ij σ j is the number of infected nearest neighbors of the vertex i, and σ j = 1 corresponds to infected node and σ j = 0 otherwise. here, we are considering the simplest case of λ i = λ and µ i = µ for all vertices. if the selected vertex i is susceptible, it becomes infected with probability this algorithm is accurate and can used for any generic sis dynamics. however, if one is interested in regions close to the threshold where the great majority of the vertices are susceptible, the algorithm is very inefficient since changes happens only in the neighborhood of infected vertices. therefore, we can use a more efficient strategy based on the previous algorithm. this strategy requires to keep and constantly update a list p with the positions of all infected vertices where changes will take place. the list update is simple. the position of a new infected is added at the end of the list. when a infected vertex becomes susceptible, the last entry of the list is moved to the index of the cured vertex. the total rate that a infected vertex becomes susceptible in the whole network is r = µn i , where n i is the number of infected vertices. analogously, the total rate that one susceptible vertex is infected is given by j = λn e , where n e is the number of vertices emanating from infected nodes. so, the sis dynamics can be simulated according to the algorithm proposed by ferreira et al. [12] as follows: the step is incremented by ∆t = 1/(r + j). with probability p = r/(r + j) an infected vertex i is selected randomly and turns it to susceptible. with complementary probability q = j/(r+j) an infected vertex is selected at random and accepted with probability proportional to its degree. in the infection attempt, a neighbor of the selected vertex is randomly chosen and if susceptible, it is infected. otherwise nothing happens and simulations run to the next time step. the cp dynamics can also be efficiently simulated if a list of occupied vertex p is used analogously to the sis algorithm. the total rate of cure is also given by r = µn i . the total creation rate is j = λn i [2] . an infected vertex i is selected with equal chance. with probability p = r/(j + r) = µ/(µ + λ) it is cured. with probability q = j/(j + r) = λ/(µ + λ) one of the k i neighbors of i is selected and, if susceptible, is infected. the time is incremented by ∆t = 1/[n i (λ + µ)]. notice that infected vertices are selected independently of their degrees and with probability n i /k i reach an already infected neighbor. although equivalent for strictly homogeneous graphs (k i ≡ k ∀ i), the sis and the cp models are very different for heterogeneous substrates. the universality class of cp and sis is the same in homogeneous lattices. both models belong to the directed percolation universality class [6] . nevertheless, in complex networks, heterogeneity affects both models and, at the heterogeneous meanfield approach, they have different critical exponents (see discussion in ref. [57] ). in both sis and cp models, the control parameter of the dynamics is the infection rate λ. in the thermodynamic limit, above a critical value λ c (epidemic threshold), a finite fraction of the population is infected. however, for λ < λ c , the epidemic can not survive and the dynamics goes to an absorbing state where everyone is susceptible. nevertheless, for a finite system the unique achievable stationary state is the absorbing state, even above the critical point, because of dynamical fluctuations. some simulation methods were proposed in the literature to solve the issue of study the active state in finite systems , as we will see in the next section. mean-field approaches and field theory renormalization are the key analytical tools to investigate dynamical processes with absorbing-state phase transition [58] . while the former is only valid above the upper critical dimension, application of the latter in physical dimensions is hindered by large technical difficulties. for this reason, most of our knowledge about the properties of absorbing-state phase transitions is based in the computer simulation of different representative models. the numerical analysis of these computer data also represents a challenge mainly because of finite size effects. in finite systems, any realization of the dynamics reaches the absorbing state sooner or later, even above the critical point, due to dynamic fluctuations. this difficulty was traditionally overcome by starting with a finite initial density of active sites and averaging only over surviving samples, i.e., realizations which have not yet fallen into the absorbing state [2, 59] . analyzing the quasistationary state defined by surviving averages, the critical point and various critical exponents can be performed by studying the decay of the survival average of different observables as a function of the system size. however averaging over surviving runs is computationally so inefficient since surviving configurations are increasingly rare at long times. a much more efficient strategy is provided by the quasistationary (qs) method, proposed by de oliveira and dickman [32, 60] , in which every time the system is on the verge of fall in an absorbing state, it jumps to an active configuration previously stored during the simulation. this can be computationally implemented by saving, and constantly updating a sample of the states already visited. the update is done by periodically replacing one of these configurations by the current one. the characteristic relaxation time is always short for epidemics on random networks due to the very small average shortest path [44] . the averaging time, on the other hand, must be large enough to guaranty that epidemics over the whole network was suitably averaged. it means that very long times are required for very low qs density (sub-critical phase) whereas relatively short times are sufficient for high density states [31, 32] . both equilibrium and non-equilibrium critical phenomena are hallmarked by simultaneous diverging correlation length and time which microscopically reflect divergence of the spatial and temporal fluctuations [1] , respectively. even tough a diverging correlation length has little sense on complex networks due to the small-world property [61] , the diverging fluctuation concept is still applicable. we used different criteria to determine the thresholds, relied on the fluctuations or singularities of the order parameters. the qs probabilityp n , that does not depend on the initial condition, is defined as the probability that the system has n occupied vertices in the qs regime, is computed during the averaging time and basic qs quantities, as lifespan and density of infected vertices, are derived fromp n . indeed, we have that ρ = 1 n np n and τ = 1/p 1 [32] , where τ is the lifespan of the epidemic. thus, thresholds for finite networks can be determined using the modified susceptibility [12] that does exhibit a pronounced divergence at the transi-tion point for sis [12, 14, 16] and contact process [53, 62] models on networks. the choice of the alternative definition, eq. (39), instead of the standard susceptibilitỹ χ = n ( ρ 2 − ρ 2 ) [6] is due to the peculiarities of dynamical processes on complex networks. in a finite system of size n , χ shows a diverging peak at λ = λ qs p (n ), providing a finite size approximation of the critical point. in the thermodynamic limit, λ qs p (n ) approaches the true critical point with the scaling form [63] as we can see for the sis model in figure 2 (a). the network was generated with the uncorrelated configuration model [25] , where vertex degree is selected from a power-law distribution p (k) ∼ k −γ with a lower bound k 0 = 3. in the context of epidemic modeling on complex networks [64] , boguñá et. al. [18] proposed another strategy which considers the lifespan of spreading simulations starting from a single infected site as a tool to determine the position of the critical point. each realization of the dynamical process is characterized by its lifespan and its coverage c, where latter is defined as the fraction of different sites which have been occupied at least once during the realization. in the thermodynamic limit realizations can have either finite or infinite, according to whether they proceed below or above the critical point. endemic realizations have an infinite lifetime and their coverage is equal to 1. finite realizations have instead a finite lifetime and a coverage vanishingly small in the limit of diverging size. in finite systems this distinction is blurred, since any realization is bound to end, reaching the absorbing state, although this can occur over long temporal scales. in practice, a realization is assumed as active whenever its coverage reaches a predefined threshold value 4 c th , which was generally takes equal to c th = 0.5. realizations ending before value c = c th is reached are considered to be finite. in this method the role of the order parameter is played by the probability prob(λ − λ c , n ) that a run is longterm, while the role of susceptibility is played by the average lifetime of finite realizations τ . for small values of λ all realizations are finite and have a very short duration τ . as λ grows the average duration of finite realizations increases, but for very large λ almost all realizations are long-term, only very short realizations remaining finite. for this reason τ exhibits a peak for a value λ p (n ) depending on n and converging to λ c in the thermodynamic limit. we can then use the average lifespan to determine numerically the critical point, as shown in figure 2 (b). as we observe in figure 3 , the critical points as a function of network size obtained by both methods are in very well agreement. in reference [67] , sander and collaborators sumarize alternative methods for work around the problem related to the absorbing phase in simulations. they mentioned the reflecting boundary condition [68] that consists basically in avoiding the absorbing state by reverting the system to the configuration that it was immediately before visit the absorbing state. other strategy is to use a uniform external field that creates particles spontaneously at a given rate which disappears in the thermodynamic limit [69] . in addition, one can use the hub reactivation method on heterogeneous networks. if the system reaches the absorbing state, the dynamics starts again with the most connected vertex of the network infected. in ref. [24] , castellano and pastor-satorras derived the hmf theory for the cp dynamic in the limit of infinite network size. they obtained the following scalinḡ at the transition point λ = λ c , also the relaxation time scales as these exponents are also obtained using a pair hmf approximation as shown in reference [53] . it is not possible to check this predictions with numerical analysis because of the finite-size effects. a comparison became possible using the finite-size scaling (fss) ansatz [59] , adapted to the network topology, and they previously concluded that cp dynamics on networks was not described by the hmf approximation. however, it was assumed in ref. [24] that heterogeneous networks follow the same fss known for regular lattices [2] . indeed, the fss on networks is more complicated than previously assumed. the behavior of the cp on networks of finite size depends not only on the number of vertices n but also on the moments of the degree distribution [19] . this implies that, for scale-free networks, the scaling around the critical point depends explicitly on how the largest degree k c diverges with the system size n . such dependence introduces very strong corrections to scaling. however, when such corrections are suitably taken into account, they showed that the cp on heterogeneous networks agrees with the predictions of hmf theory with good accuracy [22] . ferreira and collaborators [23] started from eq. (22) , and they considered, in addition, uncorrelated networks with p (k|k ) = kp (k)/k, to obtain the equation for the overall density ρ = k ρ k p (k): a mean field theory for the fss can be obtained using the strategy proposed by castellano and pastor-satorras [19] , in which the motion equation is mapped in a one-step process, in the limit of very low densities, with transition rates where w (n, m) represents the transitions from a state with m infected vertices to another state with n infected vertices. in the stationary state, dρ(t)/dt = 0, the eq. (44) read as [22] ρ k = λkρ/ k 1 + λkρ/ k . close to the criticality, when the density at long times is sufficiently small such thatρk c 1, eq. (46) becomes ρ k λkρ/ k . substituting this result in eq. (45) , one finds that the first-order approximation for the one-step processes is where g = k 2 / k 2 . the master equation for a standard one-step process is [70] dp n dt = m w (n, m)p m (t) − m w (m, n)p n (t). (48) substituting the rates (47), we find dp n dt = (n + 1)p n+1 + u n−1 p n−1 − (n + u n )p n (49) with u n = λn(1 − ng). since the probability for the process not to end up in the absorbing state up to time t, named survival probability, is given by p s (t) = n≥1 p n (t), we can define the quasistationary (qs) dis-tributionp n as [2] p n = lim withp 0 ≡ 0 and normalized condition n≥1p n = 1 (see more details in section v c). the solutions of the equation (49) have already been exhaustively investigated, then we merely report the results of ref. [22] where it was found that the critical qs distribution for large systems has the following scaling form 5 where f (x) is as scaling function with the following propthe critical quasistationary density scale as similarly, the characteristic time scales as for a network with degree exponent γ and a cutoff scaling with the system size as k c ∼ n 1/ω , where ω = max[2, γ − 1] for uncorrelated networks with power law degree distribution [25] , the factor g scales for asymptotically large systems as g ∼ k 3−γ c for γ < 3 and g ∼ const. for γ > 3. the result is a scaling law ρ ∼ n −ν and τ ∼ nα where the exponentsν andα are given bŷ (54) in ref. [23] , ferreira et al. investigated the cp on heterogeneous networks with power-law degree distribution by performing quasistationary simulations, and concluded that heterogeneous mean-field theory correctly describes the critical behavior of the contact process on quenched networks. however, some important questions remained unanswered. the transition point λ c = 1 predicted by this theory does not capture the dependence on the degree distribution observed in simulations. subleading corrections to the finite-size scaling, undetected by the one-vertex hmf theory, are quantitatively relevant for the analysis of highly heterogeneous networks (γ → 2), for which deviations from the theoretical finitesize scaling exponents were reported [23] . the hmf theory assumes that the number of connections of a vertex is the quantity relevant to determine its state and neglects all dynamical correlations. but in reference [53] , the authors present a pair hmf approximation, the simplest way to explicitly consider dynamical correlations, for the cp on heterogeneous networks. despite they found the same critical exponents obtained in the simple hmf approximation, the corrections of the finite-size scaling were better, supporting that degree based theory estimate correctly the scaling exponents of the contact process on scale-free networks. as we saw in the previous sections, distinct theoretical approaches were devised for the sis and cp models to determine an epidemic threshold λ c separating an absorbing, disease-free state from an active phase [10-12, 14-16, 18, 71] . the quenched mean-field (qmf) theory [71] explicitly includes the entire structure of the network through its adjacency matrix while the heterogeneous mean-field (hmf) theory [9, 72] performs a coarsegraining of the network grouping vertices accordingly their degree. however, for the sis model, both theories predicts different thresholds. the hmf theory predicts a vanishing threshold for the range 2 < γ ≤ 3 while a finite threshold is expected for γ > 3. conversely, the qmf theory states a threshold inversely proportional to the largest eigenvalue of the adjacency matrix, implying that the threshold vanishes for any value of γ [10] . regardless, goltsev et al. [11] proposed that qmf theory predicts the threshold for an endemic phase, in which a finite fraction of the network is infected, only if the principal eigenvector of adjacency matrix is delocalized. in the case of a localized principal eigenvector, that usually happens for large random networks with γ > 3 [73] , the epidemic threshold is associated to the eigenvalue of the first delocalized eigenvector. for γ < 3, there exists a consensus for sis thresholds: both hmf and qmf are equivalent and accurate for γ < 2.5 while qmf works better for 2.5 < γ < 3 [12, 16] . lee et. al. [14] proposed that for a range λ qm f c < λ < λ c with a nonzero λ c , the hubs in a random network become infected generating epidemic activity in their neighborhoods. this activity has a characteristic lifespan τ (k, λ) depending on the degree k and the infection rate λ. on networks where almost all hubs are directly connected the activity can be spread among them if the lifespan τ (k, λ) is large enough. then, above λ qm f c , the network is able to sustain an endemic state due to the mutual reinfection of connected hubs. however, when hubs are not directly connected, the reinfection mechanism does not work and high-degree vertices produce independent active domains. these independent domains were classified as rare-regions, in which activity can last for very long periods increasing exponentially with the domain size [74] . this means, usually we have two distinct states: λ > λ c corresponds to a supercritical phase where the system is globally active and λ < λ c corresponds to an absorbing inactive state. however, the sis dynamics running on top of power-law networks presents a region in which λ is smaller than the epidemic threshold -but greater than a certain value be-low which the epidemic actually ends -where the activity survives for very long times. this results in a slow dynamics known as griffiths phase [36, 75, 76] . the sizes of these active domains increase for increasing λ leading to the overlap among them and, finally, to an endemic phase for λ > λ c . in the thermodynamic limit these regions vanish because they decreases as soon as the network size increases [17, 73, 77] . this anomalous behavior in the subcritical phase was also investigated in reference [37] . the authors used extensive simulations to show that the sis model running on the top of power-law networks with γ > 3 can exhibit multiple peaks in the susceptibility curve that are associated with large gaps in the degree distribution among the few most highly connected nodes, which permits the formation of these independent domains of activity. however, if the number of hubs is large, as occurs for networks with γ < 3, the domains are directly connected and the activation of hubs implies in the activation of the whole network. the arguments presented by the authors of reference [37] are in agreement with the scnario investigated in refs. [11, 14] that leads to the conclusion that the threshold to an endemic phase is finite in random networks with a power law degree distribution for γ > 3. inspired in the appealing arguments of lee et al. [14] , boguñá, castellano and pastor-satorras [18] reconsidered the problem and proposed a semi-analytical approach taking into account a long-range reinfection mechanism and found a vanishing epidemic threshold for γ > 3. as reported by lee et. al. [14] , when the hubs on a network are directly connected, the activity can be spread throughout the network even in the limit λ → 0. however, when higher degrees nodes are distant from each other these hubs are able to sustain local active domains around them and only with a nonzero λ c , the endemic state is reached. nevertheless, boguñá, castellano and pastor-satorras [18] revisited the problem taking into account long range dynamic correlations in a coarse-grained time scale. as explained in ref. [18] , their approach states that a directed connection between hubs is not a necessary condition for reinfected them. there is a possibility of "long-range" reinfection since the network has a small-world property. in this approach, it was concluded that the epidemic threshold vanishes for random networks with p (k) decaying slower than exponentially, in particular, p (k) ∼ k −γ with any γ. it was rigorously proved by chatterjee and durret [42] in the thermodynamic limit, for networks with degree distribution p (k) ∼ k −γ with γ > 3. afterward, mountford and collaborators [40] expanded the result found in reference [42] including the range 2 < γ ≤ 3 of the degree exponent. they also analysed the behavior of the density of infected nodes in function of λ close to the epidemic threshold and they predicted analytical exponents which were also found in the numerical results of reference [41] . recently, castellano and pastor-satorras [39] enlight-ened the understanding of the sis dynamics elaborated a mathematical formulation of the mutual reinfection process, using the cumulative merging percolation (cmp) process proposed by menard and singh [78] . in this process, each node can be considered active with a certain probability. inactive nodes play just as a bridge between active ones. a initial cluster with size 1 contains only an active node but, in a interactive process, two clusters can colapse into one if the criterion of topological distance between them is satisfied. so the cmp process creates a cluster composed by a set of active nodes that were aggregated due to iteration of merging events. such nodes are part of the same connected component of the underlying network. the insight of castellano and pastor-satorras [39] were classified the hubs able to sustain the epidemic as these active nodes of the cmp process. therefore they could relate this process to the reactivation of hubs that are not directly connected. consequently, they observed that the presence of a cmp giant component is related to an endemic stationary state. in their paper, they showed that the epidemic threshold does not behavior as qmf prediction but it vanished more slowly, with an exponent that decreases as soon as γ increases. the dependence of the epidemic threshold with the network size that they found is in agreement with the asymptotic scaling found analytically by mountford and collaborators [40] and recently by huang and durret [79] . in this paper, we have reviewed the main features of the sis and cp models, which have been very used to describe epidemic dynamics on complex networks. throughout this review we described both models, presented distinct theoretical approaches devised for them, their advantages and disadvantages, and the main differences among them. we also exposed simulation techniques to analyze both models numerically. since both models are examples to investigate absorbing phase transitions in complex networks, we presented the main simulation strategies to overcome the difficulty to study the active state of finite networks. finally we reported the central issue for each model and we summarized the difficulties and discussions that came up in the literature related to these issues. for sis model, problems related to determine the epidemic threshold on heterogeneous networks. for cp model, concerns related to the critical exponents and the degree distribution of the network. although there are some books and articles reviewing such models, the main idea of this manuscript is to provide, in summary, an overview of the main points of this subject: the theories and simulation techniques, as also the main concerns about the investigation of epidemic models with phase transitions to absorbing states running on top of complex networks. the progress in this area grows incredibly fast and it is not possible to discuss all recent results. but we try to mention just a few of many research lines. in the last decades, epidemic models have also been studied in hypergraphs [80] , temporal networks [81] [82] [83] , metapopulations [84, 85] and also multiplex subtrates [86, 87] . they analyzed, among other issues, epidemic spreading with awareness, social contagion, measures of epidemic control and how patterns of mobility affects the transmission of the disease. there are also studies about the impact of infectious period or recovery rates on epidemic spreading [88] [89] [90] , spectral properties of epidemic in correlated networks [41] , the speed of disease spreading [91] . it is also important to mention the challenges in modelling spreading diseases related to public health, global transmission [92, 93] as well livestock and vector-borne diseases [94, 95] . the relevance of studying epidemic models is also evident when faced with alarming situations such as the recent pandemic of covid-19 caused by the new coronavirus [96] [97] [98] . these references and the others cited throughout the manuscript provide accurate studies for readers who wish to go deep into the subject. non-equilibrium phase transitions nonequilibrium phase transitions in lattice models mathematical epidemiology of infectious diseases: model building, analysis and interpretation infectious diseases in humans nonequilibrium phase transition: absorbing phase transitions complex webs in nature and technology dynamical processes on complex networks the mathematical theory of infectious diseases and its applications networks: an introduction chaos and nonlinear dynamics: an introduction for scientists and engineers proc. natl. acad. sci. usa critical dynamics of current physics-sources and comments monte carlo simulation in statistical physics: an introduction stochastic processes in physics and chemistry this work was partially supported by capes, fapemig and cpnq. the author thanks the financial support from fapemig (grant no. apq-02482-18) and cnpq (grant no. 423185/ 2018-7). the author also wishes to express her deepest gratitude for silvio c. ferreira for reading this overview and providing relevant suggestions. key: cord-312817-gskbu0oh authors: witte, carmel; hungerford, laura l.; rideout, bruce a.; papendick, rebecca; fowler, james h. title: spatiotemporal network structure among “friends of friends” reveals contagious disease process date: 2020-08-06 journal: plos one doi: 10.1371/journal.pone.0237168 sha: doc_id: 312817 cord_uid: gskbu0oh disease transmission can be identified in a social network from the structural patterns of contact. however, it is difficult to separate contagious processes from those driven by homophily, and multiple pathways of transmission or inexact information on the timing of infection can obscure the detection of true transmission events. here, we analyze the dynamic social network of a large, and near-complete population of 16,430 zoo birds tracked daily over 22 years to test a novel “friends-of-friends” strategy for detecting contagion in a social network. the results show that cases of avian mycobacteriosis were significantly clustered among pairs of birds that had been in direct contact. however, since these clusters might result due to correlated traits or a shared environment, we also analyzed pairs of birds that had never been in direct contact but were indirectly connected in the network via other birds. the disease was also significantly clustered among these friends of friends and a reverse-time placebo test shows that homophily could not be causing the clustering. these results provide empirical evidence that at least some avian mycobacteriosis infections are transmitted between birds, and provide new methods for detecting contagious processes in large-scale global network structures with indirect contacts, even when transmission pathways, timing of cases, or etiologic agents are unknown. avian mycobacteriosis is a bacterial disease that has long been considered contagious, passing indirectly between birds through the fecal-oral route [1, 2] . however, recent long-term studies in well-characterized cohorts have found low probabilities of disease acquisition among exposed birds [3, 4] and multiple strains and species of mycobacteria associated with single outbreaks [5] [6] [7] [8] . these findings suggest pre-existing environmental reservoirs of potentially san diego zoo global houses one of the largest, breeding bird populations in the world, historically averaging over 3,000 birds at any given time across two facilities, the san diego zoo and san diego zoo safari park (collectively referred to as san diego zoo global, sdzg). birds are frequently moved among enclosures for breeding, behavior or other management reasons, as well as imported from or exported to other institutions. this creates a dynamic network of contacts over time that varies individual exposure to environments and other birds. the source population included 16,837 birds present at sdzg between 1 january 1992-1 june 2014 that were at least 6 months old and present for at least seven days. all birds in this population were under close keeper observation and veterinary care during the entire study period and received complete post-mortem exams if they died. birds in this population had documented dates of hatch, acquisition, removal, and death. we excluded a small number of birds (n = 437) because they had incomplete information on enclosure histories. the 16,430 remaining birds had near-complete enclosure tracking over time with move-in and move-out dates for each occupied enclosure. all management data were stored in an electronic database. thus, the population represents a group of birds for which 1) a near-complete social network could be assembled from housing records that tracked dynamic movement over time, and 2) avian mycobacteriosis disease status could be determined for any bird that died. all historic data in these retrospective analyses were originally collected for medical activities and animal management purposes unrelated to the present study. for these reasons, the san diego zoo global institutional animal care and use committee exempted our study from review for the ethical use of animals in research. if a bird in the source population died, a board-certified veterinary pathologist conducted a thorough post-mortem exam that included histopathology on complete sets of tissues, unless advanced autolysis precluded evaluation. if lesions suggestive of avian mycobacteriosis were observed, then special stains (ziehl-neelsen or fite-faraco) were used to confirm presence of acid-fast bacilli. occasionally, clinical presentation permitted antemortem diagnosis based on tissue biopsy. for this study, any bird with acid-fast bacilli present in tissues was considered positive for avian mycobacteriosis at the date of diagnosis. birds were classified as 'infected' on their date of diagnosis or 'uninfected' on their date of death if the post-mortem examination showed no evidence of disease. birds were also classified as 'uninfected' on their date of export if they were still apparently healthy. birds that were still alive on the study end date of 6/1/2014, were followed for up to the assumed minimum incubation period (further described below; e.g., six months or through 11/28/2014) to determine final disease status. the network was defined based on the subset of birds that qualified as subjects and their friends (network nodes), and the connections between them (network edges). study subjects included all birds from the source population with complete information on history of exposure to other birds. this included both birds that hatched in the population, as well as birds imported from elsewhere. if a bird was imported, then it must have been present for a duration equal to or greater than the maximum incubation period (further defined below); those that were present for less time were not included as a study subject because they could have been infected prior to importation. any bird that directly shared an enclosure with a subject for at least seven days was considered a "friend". thus, the same bird could serve as both a subject as well as a friend for other birds, as illustrated in fig 1. spatial connections between subjects and friends were determined through cross-referencing enclosure move-in and move-out dates of all birds. contact occurring in a few enclosures, including hospital and quarantine enclosures, could not be determined and was therefore excluded. exposures that could lead to potential transmission of mycobacteriosis would be those which occurred within the incubation period of the subject (fig 1) . however, the distribution of the true incubation period for avian mycobacteriosis is unknown. as a starting point, minimum incubation period, i.e., the minimum time for an exposure to result in detectable disease, was set to six months. this was based on early literature from experimental studies that mimicked natural transmission [27, 28] . this is also consistent with our own data where the earliest case in the population occurred at 182 days of age [3] . the maximum incubation period was set to two years. early studies reported deaths occurring up to 12-14 months after infection [27] [28] [29] ; however, some authors reviewed by feldman [1] considered it possible that the disease progression could take years. for subjects that were classified as non-infected, this same interval (two years to six months prior to death or censoring) was used to identify contact with friends. for example, if a subject died on january 1, 2005, it would be connected to all friends with which it shared an enclosure for at least seven days within the time window of two years until six months prior to the subject's death, or between january 1, 2003 and july 1, 2004. exposures of subjects to friends that could lead to potential disease transmission would also be those which occurred within the friends' infectious periods when the bacteria could spread to other birds (fig 1) . the period of shedding during which a bird is infectious for other birds is unknown and no estimates were available for a naturally occurring disease course. therefore, friends were assumed to be infectious for the maximum incubation time, or two years, as diagram of potential transmission relationships and connectivity of birds in the network. the figure represents three example birds, assessed for the potential for each to have acquired infection from the other. each bird, or "subject", was defined to have an incubation period, initially set to the period occurring six to 24 months before the bird's final date in the study. any other bird that shared an enclosure with the subject during its incubation period was defined as a "friend" if the two birds shared the space during the second bird's infectious period. a friend's infectious period was initially set to the period occurring two years prior to its final date in the study. thus, the figure shows the incubation and infectious periods for each bird in the larger bars while the smaller bars show the overlapping period when the other two birds would be defined as its infectious"friends". the network edges were created from identifying the spatial and temporal overlap of potential incubation and infectious periods of subjects and friends in the study population. https://doi.org/10.1371/journal.pone.0237168.g001 a starting point. exposure of the subject to friends that were not infected was considered for the same two-year period prior to the friend's final date in the study. fig 2a and 2b illustrates network assembly over time for an example subject and its friends. the transmission network was graphed using the kamada-kawai [30] algorithm and all visualizations and analyses were performed using r software, package: igraph [31] . an initial clustering of disease associated with all indirect contacts that could influence the subject's disease status (based on timing of contact). friends-of-friends: same environment. clustering of disease associated with influential indirect contacts that were exposed to the same enclosure/environment. friends-of-friends: contagion. clustering of disease associated with influential indirect contacts that were never exposed to the same environment. this evaluation is key for removing the confounding effects of the environment and testing for a contagious process. friends-of-friends: homophily. clustering of disease associated with friends of friends that were never exposed to the same environment and could not have transmitted disease to the subject based on the timing of the connection. this reverse-time placebo test evaluates our data for the presence of homophily, or whether disease clustering can be explained by similarities among connected birds. https://doi.org/10.1371/journal.pone.0237168.g002 network was structured to include all connections of seven days or more between birds that occurred during their lifetimes. from this, the transmission network used in the analyses was constructed by refining connectivity based on the subjects' incubation periods and friends' infectious periods as described above. network topology was characterized by size (number of nodes and edges), average path length, and transitivity (probability that two connected birds both share a connection with another bird). to evaluate statistically whether or not disease status of a subject is predicted by the disease status of its friend, we calculated the probability of mycobacteriosis in a subject given exposure to an additional infected friend relative to the probability of mycobacteriosis in a subject exposed to an additional non-infected friend, i.e., the relative risk (rr). to determine significance of the rr, the observed rr was compared to the distribution of the same rr calculation on 1000 randomly generated null networks where the network topology and disease prevalence were preserved, but the disease status was randomly shuffled to different nodes [15, 32] . if the observed rr fell outside the range of permuted values between the 2.5 th and 97.5 th percentiles, i.e., the null 95% confidence interval (ci), then we rejected the null hypothesis that the observed relationship was due to chance alone. reported p-values were estimated from the null 95% ci. we evaluated the relative risk of disease transmission through five types of shared relationships between subjects and their friends ( fig 2b) . each evaluation targeted different groups of subject-friend pairs that varied in degrees of separation as well as spatial and temporal characteristics of network edges. risk (also referred to as "clustering") of disease associated with directly connected birds, or "friends". this analysis examined all pairs of birds where the subject was in direct contact with its friend during the subject's defined incubation period and the friend's infectious period (illustrated in fig 1) . the rr estimate includes the combined risk from direct exposure to both other infected birds and a common environmental source. this analysis examined whether associations persisted among the indirectly connected friends, as observed in other contagious processes [15] . to identify these friends of friends, we constructed a matrix of shortest paths between all subject-friend pairs that never directly shared an enclosure but were indirectly connected through an intermediary bird. before estimating the rr and conducting the random permutation tests, the data were limited to each subject's set of "influential" nodes, or the friends of friends connected by pathways that respect time ordering along which disease could propagate [33] . in other words, the friend of friend shared an enclosure with an intermediary bird before the intermediary bird contacted the subject. the estimated rr includes the indirect risk of disease from both contagion and exposure to a common environmental source. risk of disease transmission associated with influential friends of friends sharing an environment with their subject. this analysis examined associations with the subset of all influential friends of friends, where both birds were in the same enclosure but not at the same time. for example, bird a shares an enclosure with c. if a moves out and b subsequently moves in, then b is exposed to a via c. importantly, both a and b also were exposed to the same environment. associations in this group would reflect a combination of risk due to common environmental exposure and contagion. for contagion, we evaluated associations with the influential friends of friends that were never in the same enclosure as their subject. from the earlier example, if bird c is moved from an enclosure with bird a to an enclosure with bird b, then b is exposed to a via c. that is, a can transmit infection to b, even though they never shared an enclosure. case clustering could not be attributed to exposure to the same environment because the subject and its friends of friends were never housed in the same enclosure. this evaluation also ensured correct temporal alignment between exposure to an infectious agent and disease outcome in the subject. this comparison was key for removing confounding effects of environmental exposure and testing for a contagious process. although disease clustering among friends of friends could represent a contagious process, there is a possibility that some of the association could be explained by homophily, i.e., that connected birds could be more alike than the general bird population in terms of species, behavior, susceptibility, enclosure characteristics, etc. [19] . this could make both birds more likely to acquire infection from any source and manifest as clustering on a network at degrees of separation. we tested the network for the presence of homophily using a reverse-time placebo test. for this test, we evaluated disease clustering between a subject and its friends of friends from different enclosures that could not have transmitted infection based on the timing of the contact. for the tests of contagion, we described how b could be exposed to a via c; however, in that same example, the reverse would not be true. b could not transmit infection to a because disease transmission is time-dependent. for our reverse-time placebo test, we evaluated whether the infection status of b predicted the infection status in a. if so, then it would suggest homophily is present and driving disease clustering. sensitivity analyses were performed to compare differences in rr estimates while varying model assumptions. we varied subjects' incubation time (testing a minimum of three months and a maximum of one, three, four and five years) and friends' infectious time (two years, one year, and six months). we also refined network edges to evaluate associations in subsets of data where biases were minimized. this included limiting the friends to those whose exposure to the subject was exclusively outside of the two-year infectious window. it also included refining network edges to contact between subjects and friends that occurred only in small enclosures where enclosure sharing may be a better proxy of true exposure. finally, we limited analyses to subjects and friends that died and received a post-mortem examination. the 16,430 birds in the source population consisted of 950 species and subspecies housed across 848 enclosures. mycobacteriosis was diagnosed in 275 of these birds (1.7%). the subset that qualified as study subjects included 13,409 of the birds, which represented 810 species and subspecies. subjects were housed across 837 different enclosures that varied in size, housing anywhere from one to over 200 birds at any given time. in total, 203 (1.5%) subjects developed mycobacteriosis. subjects were present in the study population for variable amounts of time with the median follow-up being 3.4 years (iqr: 1.4-7 years). on average, subjects moved between enclosures 4.4 times (sd: 4.1; range: 0-71), and were housed in three separate enclosures (sd: 2.5; range 1-26). the average time a subject spent with each friend was about ten months (314 days; sd: 201 days). the full network that included all subject-friend connections contained 2,492,438 edges, but we focused on the transmission network limited to plausible fecal-oral transmission routes based on sharing an enclosure for at least 7 days during the subjects' incubation periods and its friends' infectious periods. this transmission network included all 16,430 birds with 905,499 connections linking their temporal and spatial location. the median number of friends each subject contacted (network degree centrality), was 105 (iqr: 21-303; range: 0-1435). the network exhibited small world properties [34] with short paths (average path length = 3.8) and many cliques where groups of birds were all connected to each other (transitivity = 0.63). a portion of the network diagram that includes subjects infected with avian mycobacteriosis and their directly connected friends is shown (fig 3) . results from all five associations are shown in fig 4 and rr estimates with p-values are reported in table 1 . when we performed our test between the subject and its directly connected friends we found significant clustering of cases based on social network ties; the risk of mycobacteriosis given exposure to an infected friend was 7.0 times greater than the risk of mycobacteriosis given exposure to an uninfected friend (p<0.001). significant associations persisted among the friends of friends. the rr of disease given exposure to any influential, infected friend of friend, compared to exposure to an uninfected friend of friend, was 1.35 (p<0.001). when subset to just the influential friends of friends that shared the same environment, the rr was 1.47 (p = 0.004). importantly, the friends-of-friends contagion model identified a significant 31% increase in risk of infection among subjects that were exposed to an infected friend of friend compared to those exposed to an uninfected friend of friend (rr: 1.31, p = 0.004). we found no evidence of homophily with our reverse-time placebo test; i.e., there was no significant association when the friends of friends were limited to those who may have correlated traits, but could not have influenced the subject's disease status based on location and timing of their indirect connection (rr: 0.95; p = 0.586). results of sensitivity analyses for all five evaluated relationships are shown in table 1 . the sensitivity analyses did not yield drastically different findings than the analyses of the main network and the significance of most associations remained. generally, as the subjects' incubation periods increased, the magnitude of the rrs with the friends and friends of friends decreased. this same pattern was observed when connectivity was limited to that occurring two years prior to the friends' removal dates (i.e., outside of the friends' incubation windows). patterns of significance were mostly unchanged when the network was limited to just animals with post-mortem exams, and just birds housed in small enclosures. importantly, significant disease clustering in the test for contagion persisted in most examined network variations. the exception to this is when the subjects' maximum incubation periods or the friends' infectious periods became more narrowly defined. homophily was detected only when network edges were restricted to exposures outside friends' incubation periods when long time spans were present (rr: 1.10; p = 0.014). our friends-of-friends network analysis suggests that avian mycobacteriosis can spread through bird social networks. although connected birds may acquire infection from exposure to common environmental sources and may share features that make them more likely to acquire disease through the environment, our friends-of-friends method detected statistically significant bird-to-bird transmission. one of the biggest challenges in determining if bird-to-bird contagion is present for infectious agents that are present in the environment, such as mycobacteria, is distinguishing the role of the environment. in one scenario, the environment serves as an intermediate collection place for mycobacteria being passed via (mostly) fecal contamination from an infected bird to one or more other birds, leading to infection spread in chain-or web-like patterns across a network [35] . alternatively, the environment may serve as the natural, independent reservoir of mycobacteria (e.g., biofilms in the water [36] ) giving rise to opportunistic infection among birds that share the location. spatial and temporal disease clustering could represent either or both of these two infection routes. homophily, where connected individuals tend to be more alike in species or habitat needs than the general population, and, therefore, may share the same disease susceptibility, could occur in both scenarios. for the directly connected birds in our study, the significantly elevated rr represented a combination of these three effects. examining the friends of friends rather than directly connected birds provided a means to disassociate exposure to another bird from exposure to that bird's environment. at two degrees of separation, the characteristics of network edges were more distinct, with temporal separation in potential transmission pathways and spatial separation in location. we exploited these pathways in a stepwise approach to calculate the rr of disease given exposure to friends of friends with different types of network ties. the subset of all influential friends of friends were temporally aligned to pass infection, but this group again represented a combined effect of multiple transmission pathways. because there was no evidence of significant homophily (further discussed below), we could use the network structure to test for the presence of contagion. among subjects who were connected to infected friends of friends in a different enclosure, the significant increase in risk for mycobacteriosis represents contagion. while this very specific subset of network edges allowed us to disentangle environmental and contagious transmission, it required two consecutive infections among a chain of related birds. this ignored most subjects and their friends of friends that shared enclosures where both processes were possible and completely confounded. while our extensive, long-term set of connections = 16,430) . the estimated relative risk (rr) for each of five different relationships between subjects and friends that were directly and indirectly connected. evaluated relationships are described in the methods and fig 2b. significance of the estimate was determined by comparing conditional probability of mycobacteriosis in the observed network with 1000 permutations of an identical network (with the topology and incidence of mycobacteriosis preserved) in which the same number of infected birds were randomly distributed. error bars show the null 95% confidence intervals generated from the random permutations. rrs that were outside of the null and significant are indicated with � . https://doi.org/10.1371/journal.pone.0237168.g004 in this network allowed detection of disease transmission using just this subset, the relative risks likely underestimate the true magnitude of bird-to-bird contagion. our data show significant, directional clustering along the pathways on which disease could propagate; however we did not find clustering when we reversed these pathways-where birds were connected, but disease could not be transmitted because passing an infection cannot move backwards through time. we applied our test of directionality, which is similar to those used by others [15] , to evaluate whether homophily could be driving the observed associations. in this bird population, similar species with comparable habitat needs have always tended to be housed together. therefore, we would expect biases due to homophily would exert similar effects along all pathways of connectivity, regardless of time. it is well documented that homophily and contagion are confounded in social networks [37, 38] and we could not specifically adjust the rrs for unobserved homophily; however many of the psychosocial factors that lead to homophily in human networks [19, 38] are not directly applicable to birds. while homophily table 1 , 1992-2014 (n = 16,430) . subjects network edges the five evaluated relationships are described in detail in the methods and fig 2b. the calculated statistic is the probability that a subject has disease, given that its friend has disease, compared to the probability that a subject has disease given that its friend does not (i.e., rr). to determine whether the observed rr falls within the 2.5 th and 97.5 th percentile of the null distribution, the disease status was randomly shuffled in 1000 network permutations where the network structure and prevalence of mycobacteriosis was preserved. significant p values indicate that the observed rr fell outside of the null 95% ci and we reject the null hypothesis that the observed rr is due to chance alone. https://doi.org/10.1371/journal.pone.0237168.t001 might still be present, our data strongly suggest that it is not driving the observed clustering of disease between a subject and its friends of friends. historically, in experimental infection studies, birds have been shown to be susceptible to the infectious bacilli when directly administered, i.e., introduced intravenously, intramuscularly, intraperitoneally, subcutaneously, or orally [39] [40] [41] [42] . yet, the relevance of direct inoculation to natural transmission has always been tenuous. some studies have shown little to no transmission when healthy chickens were placed in contact with either diseased birds or their contaminated environments [43] . therefore, our study provides new evidence, which supports bird-to-bird transmission in natural settings. our results also suggest that avian mycobacteriosis is not highly contagious, which is consistent with early experimental studies that conclude the bacteria must be given repeatedly over long periods of time to ensure infection [1] . the small world network structure that we identified for birds in the study population would predict epidemic-style outbreaks for diseases with facile and rapid transmission [34, 44] ; however, most birds did not acquire infection even when directly linked to other positive birds for long periods. over time, we have not seen epidemics and the incidence of disease in this population is consistently low (1%) [3] . our network approach was elucidating in this particular scenario, enabling us to uncover subtle patterns of a contagious process. environmental mycobacteria are recognized as the cause for ntm infections in humans and other animals [9] [10] [11] . limited genetic and speciation data from managed avian populations have found multiple strains and species of mycobacteria attributed to single outbreaks [6] [7] [8] . in our bird population, several different species and genotypes of mycobacteria have also been identified [5, 45] . consequently, we know that some birds could not have passed the infection to each other. genetic data from mycobacterial isolates would be a more definitive method of identifying the transmission of infection within a shared environment. for the present study, our approach was to isolate and test for contagion when there is missing information on the specific etiologic agents and transmission pathways. additional studies using genetic data could refine relevant transmission pathways or highlight important environmental sources within the network. we took care in assembling our network to ensure that the edge construction between subjects and friends adhered to general recommendations for disease networks [26, 46, 47] . this included incorporating biologically meaningful time-periods relevant to mycobacterial disease ecology and the type of exposure needed for transmission. generally, mycobacteriosis is considered a chronic disease, with an incubation period that can last for months and possibly years [1, 2] . it is also thought that animals can insidiously shed the organisms for long periods of time and those organisms can potential stay viable in the environment for years [48, 49] . we know there is misclassification of exposure in this network, because the true extents of incubation and infectious periods are wide, variable, and unknown. in sensitivity analyses, our rr estimates were generally similar when we varied incubation and infectious periods (table 1) . we did find a significant rr when limiting network edges to those occurring before the friends' 2-year incubation period, which suggests that some contagious processes may occur before the 2-year window. we also found that evidence for contagion was lost when either the subject incubation period or friend infectious period was short (less than six months and less than one year, respectively). it is likely that the shorter incubation times did not allow sufficient overlap of risk periods between subjects and friends. the duration of exposure needed for transmission is also unknown, but birds can be housed together for a year or more and not acquire infection [1, 4] . generally, mathematical models show that increasing the intensity or duration of contact between individuals with an infectious disease increases the probability of a transmission event and this can be reflected in weighted networks [35, 50] . in the present study, we required a minimum of seven days together to establish a network link that could capture relevant, short-duration exposure; however, the majority of birds were together for longer, with the mean contact-days being about 10.5 months (314 days). further exploration of contact heterogeneity on network associations may provide additional insight into clinically relevant exposure, infectious periods, and incubation times. inferring contagion by testing for disease clustering in subsets of the network requires quite complete network ascertainment, very good information on location over time, knowledge of disease outcomes, and a large number of subjects and their connected friends over time. our zoo data were unique in this respect and represent an example of how network substructures can inform global disease processes. many of the issues that cause bias in network measures, such as node censoring [51] , network boundary specification [52] , or unfriending [53] are unlikely to have affected our findings due to the completeness of our data. while such data may currently be rare, large datasets with similar network resolution may become widely available in the future as the world becomes increasingly connected by technology. for example, many new public and private contact-tracing initiatives are taking advantage of mobile phone technology to digitally track covid-19. eventually, these may allow near-complete human disease transmission networks to be assembled. this makes our friends-of-friends social network approach using network substructures a viable option for informing indirect covid-19 transmission pathways and public policy. most epidemiologic studies that use a network approach focus on directly transmitted, infectious diseases [47] . social networks to investigate diseases transmitted through the environment are assembled less often because defining contact in the presence of environmental persistence or other important transmission routes, such as fomites or insects, can be challenging [26] . to our knowledge, this is the first application of a friends-of-friends method to determine whether global patterns of connectivity support a contagious process. similar approaches could be useful to investigate diseases of humans or animals when the network is complete and mobility patterns are known, but the disease etiology or transmission pathways are unknown. avian tuberculosis infections. baltimore: the williams & wilkins company diseases of poultry investigation of characteristics and factors associated with avian mycobacteriosis in zoo birds investigation of factors predicting disease among zoo birds exposed to avian mycobacteriosis molecular epidemiology of mycobacterium avium subsp. avium and mycobacterium intracellulare in captive birds pcr-based typing of mycobacterium avium isolates in an epidemic among farmed lesser white-fronted geese (anser erythropus) mycobacterium avium subsp. avium distribution studied in a naturally infected hen flock and in the environment by culture, serotyping and is901 rflp methods avian tuberculosis in naturally infected captive water birds of the ardeideae and threskiornithidae families studied by serotyping, is901 rflp typing, and virulence for poultry current epidemiologic trends of the nontuberculous mycobacteria (ntm) primary mycobacterium avium complex infections correlate with lowered cellular immune reactivity in matschie's tree kangaroos (dendrolagus matschiei) simian immunodeficiency virus-inoculated macaques acquire mycobacterium avium from potable water during aids the spread of obesity in a large social network over 32 years dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the framingham heart study social network sensors for early detection of contagious outbreaks social contagion theory: examining dynamic social networks and human behavior prolonged outbreak of mycobacterium chimaera infection after open-chest heart surgery geographic prediction of human onset of west nile virus using dead crow clusters: an evaluation of year 2002 data in new york state cancer cluster investigations: review of the past and proposals for the future birds of a feather: homophily in social networks relationships between mycobacterium isolates from patients with pulmonary mycobacterial infection and potting soils mycobacterium chimaera outbreak associated with heater-cooler devices: piecing the puzzle together contact networks in a wild tasmanian devil (sarcophilus harrisii) population: using social network analysis to reveal seasonal variability in social behaviour and its implications for transmission of devil facial tumour disease badger social networks correlate with tuberculosis infection integrating association data and disease dynamics in a social ungulate: bovine tuberculosis in african buffalo in the kruger national park influence of contact heterogeneity on tb reproduction ratio r 0 in a free-living brushtail possum trichosurus vulpecula population infectious disease transmission and contact networks in wildlife and livestock untersuchungen uber die tuberkulinkehllappenprobe beim huhn die empfanglichkeit des huhnes fur tuberkulose unter normalen haltungsbedingungen seasonal distribution as an aid to diagnosis of poultry diseases an algorithm for drawing general undirected graphs the igraph software package for complex network research a guide to null models for animal social network analysis network reachability of real-world contact sequences dynamics, and the small-world phenomenon mathematical models of infectious disease transmission surrounded by mycobacteria: nontuberculous mycobacteria in the human environment homophily and contagion are generically confoudnded in observational social network studies origins of homophily in an evolving social network contribution to the experimental infection of young chickens with mycobacterium avium morphological changes in geese after experimental and natural infection with mycobacterium avium serotype 2 a model of avian mycobacteriosis: clinical and histopathologic findings in japanese quail (coturnix coturnix japonica) intravenously inoculated with mycobacterium avium experimental infection of budgerigars (melopsittacus undulatus) with five mycobacterium species bulletin north dakota agricultural experimental station mathematical studies on human disease dynamics: emerging paradigms and challanges whole-genome analysis of mycobacteria from birds at the san diego zoo network transmission inference : host behavior and parasite life cycle make social networks meaningful in disease ecology networks and the ecology of parasite transmission: a framework for wildlife parasitology avian tuberculoisis: collected studies. technical bulletin of the north dakota agricultural experimental station field manual of wildlife diseases: general field procedures and diseases of birds. usgs-national wildlife health center epidemic processes in complex networks censoring outdegree compromises inferences of social network peer effects and autocorrelation social networks and health: models, methods, and applications the "unfriending" problem: the consequences of homophily in friendship retention for causal estimates of social influence we thank the many people from sdzg that made this work possible, including the disease investigations team and veterinary clinical staff for ongoing disease surveillance, and the animal management staff for tracking housing histories of birds. we thank caroline baratz, dave rimlinger, michael mace, and the animal care staff for assistance with historical enclosure data. we thank richard shaffer, florin vaida, and christina sigurdson for thoughtful comments on manuscript preparation. key: cord-143847-vtwn5mmd authors: ryffel, th'eo; pointcheval, david; bach, francis title: ariann: low-interaction privacy-preserving deep learning via function secret sharing date: 2020-06-08 journal: nan doi: nan sha: doc_id: 143847 cord_uid: vtwn5mmd we propose ariann, a low-interaction framework to perform private training and inference of standard deep neural networks on sensitive data. this framework implements semi-honest 2-party computation and leverages function secret sharing, a recent cryptographic protocol that only uses lightweight primitives to achieve an efficient online phase with a single message of the size of the inputs, for operations like comparison and multiplication which are building blocks of neural networks. built on top of pytorch, it offers a wide range of functions including relu, maxpool and batchnorm, and allows to use models like alexnet or resnet18. we report experimental results for inference and training over distant servers. last, we propose an extension to support n-party private federated learning. the massive improvements of cryptography techniques for secure computation over sensitive data [15, 13, 28] have spurred the development of the field of privacy-preserving machine learning [45, 1] . privacy-preserving techniques have become practical for concrete use cases, thus encouraging public authorities to use them to protect citizens' data, for example in covid-19 apps [27, 17, 38, 39] . however, tools are lacking to provide end-to-end solutions for institutions that have little expertise in cryptography while facing critical data privacy challenges. a striking example is hospitals which handle large amounts of data while having relatively constrained technical teams. secure multiparty computation (smpc) is a promising technique that can efficiently be integrated into machine learning workflows to ensure data and model privacy, while allowing multiple parties or institutions to participate in a joint project. in particular, smpc provides intrinsic shared governance: because data are shared, none of the parties can decide alone to reconstruct it. this is particularly suited for collaborations between institutions willing to share ownership on a trained model. use case. the main use case driving our work is the collaboration between healthcare institutions such as hospitals or clinical research laboratories. such collaboration involves a model owner and possibly several data owners like hospitals. as the model can be a sensitive asset (in terms of intellectual property, strategic asset or regulatory and privacy issues), standard federated learning [29, 7] that does not protect against model theft or model retro-engineering [24, 18] is not suitable. to data centers, but are likely to remain online for long periods of time. last, parties are honestbut-curious, [20, chapter 7.2.2] and care about their reputation. hence, they have little incentive to deviate from the original protocol, but they will use any information available in their own interest. contributions. by leveraging function secret sharing (fss) [9, 10] , we propose the first lowinteraction framework for private deep learning which drastically reduces communication to a single round for basic machine learning operations, and achieves the first private evaluation benchmark on resnet18. • we build on existing work on function secret sharing to design compact and efficient algorithms for comparison and multiplication, which are building blocks of neural networks. they are highly modular and can be assembled to build complex workflows. • we show how these blocks can be used in machine learning to implement operations for secure evaluation and training of arbitrary models on private data, including maxpool and batchnorm. we achieve single round communication for comparison, convolutional or linear layers. • last, we provide an implementation 1 and demonstrate its practicality both in lan (local area network) and wan settings by running secure training and inference on cifar-10 and tiny imagenet with models such as alexnet [31] and resnet18 [22] . related work. related work in privacy-preserving machine learning encompasses smpc and homomorphic encryption (he) techniques. he only needs a single round of interaction but does not support efficiently non-linearities. for example, ngraph-he [5] and its extensions [4] build on the seal library [44] and provide a framework for secure evaluation that greatly improves on the cryptonet seminal work [19] , but it resorts to polynomials (like the square) for activation functions. smpc frameworks usually provide faster implementations using lightweight cryptography. minionn and deepsecure [34, 41] use optimized garbled circuits [50] that allow very few communication rounds, but they do not support training and alter the neural network structure to speed up execution. other frameworks such as sharemind [6] , secureml [36] , securenn [47] or more recently falcon [48] rely on additive secret sharing and allow secure model evaluation and training. they use simpler and more efficient primitives, but require a large number of rounds of communication, such as 11 in [47] or 5 + log 2 (l) in [48] (typically 10 with l = 32) for relu. aby [16] , chameleon [40] and more recently aby 3 [35] mix garbled circuits, additive or binary secret sharing based on what is most efficient for the operations considered. however, conversion between those can be expensive and they do not support training except aby 3 . last, works like gazelle [26] combine he and smpc to make the most of both, but conversion can also be costly. works on trusted execution environment are left out of the scope of this article as they require access to dedicated hardware [25] . data owners which cannot afford these secure enclaves might be reluctant to use a cloud service and to send their data. notations. all values are encoded on n bits and live in z 2 n . note that for a perfect comparison, y + α should not wrap around and become negative. because y is in practice small compared to the n-bit encoding amplitude, the failure rate is less than one comparison in a million, as detailed in appendix c.1. security model. we consider security against honest-but-curious adversaries, i.e., parties following the protocol but trying to infer as much information as possible about others' input or function share. this is a standard security model in many smpc frameworks [6, 3, 40, 47] and is aligned with our main use case: parties that would not follow the protocol would face major backlash for their reputation if they got caught. the security of our protocols relies on indistinguishability of the function shares, which informally means that the shares received by each party are computationally indistinguishable from random strings. a formal definition of the security is given in [10] . about malicious adversaries, i.e., parties who would not follow the protocol, as all the data available are random, they cannot get any information about the inputs of the other parties, including the parameters of the evaluated functions, unless the parties reconstruct some shared values. the later and the fewer values are reconstructed, the better it is. as mentioned by [11] , our protocols could be extended to guarantee security with abort against malicious adversaries using mac authentication [15] , which means that the protocol would abort if parties deviated from it. our algorithms for private equality and comparison are built on top of the work of [10] , so the security assumptions are the same as in this article. however, our protocols achieve higher efficiency by specializing on the operations needed for neural network evaluation or training. we start by describing private equality which is slightly simpler and gives useful hints about how comparison works. the equality test consists in comparing a public input x to a private value α. evaluating the input using the function keys can be viewed as walking a binary tree of depth n, where n is the number of bits of the input (typically 32). among all the possible paths, the path from the root down to α is called the special path. figure 1 illustrates this tree and provides a compact representation which is used by our protocol, where we do not detail branches for which all leaves are 0. evaluation goes as follows: two evaluators are each given a function key which includes a distinct initial random state (s, t) ∈ {0, 1} λ × {0, 1}. each evaluator starts from the root, at each step i goes down one node in the tree and updates his state depending on the bit x[i] using a common correction word cw (i) ∈ {0, 1} 2(λ+1) from the function key. at the end of the computation, each evaluation outputs t. as long as x[i] = α[i], the evaluators stay on the special path and because the input x is public and common, they both follow the same path. if a bit x[i] = α[i] is met, they leave the special path and should output 0 ; else, they stay on it all the way down, which means that x = α and they should output 1. the main idea is that while they are on the special path, evaluators should have states (s 0 , t 0 ) and (s 1 , t 1 ) respectively, such that s 0 and s 1 are i.i.d. and t 0 ⊕ t 1 = 1. when they leave it, the correction word should act to have s 0 = s 1 but still indistinguishable from random and t 0 = t 1 , which ensures t 0 ⊕ t 1 = 0. each evaluator should output its t j and the result will be given by t 0 ⊕ t 1 . the formal description of the protocol is given below and is composed of two parts: first, in algorithm 1, the keygen algorithm consists of a preprocessing step to generate the functions keys, and then, in algorithm 2, eval is run by two evaluators to perform the equality test. it takes as input the private share held by each evaluator and the function key that they have received. they use g : {0, 1} λ → {0, 1} 2(λ+1) , a pseudorandom generator, where the output set is {0, 1} λ+1 ×{0, 1} λ+1 , and operations modulo 2 n implicitly convert back and forth n-bit strings into integers. intuitively, the correction words cw (i) are built from the expected state of each evaluator on the special path, i.e., the state that each should have at each node i if it is on the special path given some initial state. during evaluation, a correction word is applied by an evaluator only when it has t = 1. hence, on the special path, the correction is applied only by one evaluator at each bit. algorithm 1: keygen: key generation for equality to α if at step i, the evaluator stays on the special path, the correction word compensates the current states of both evaluators by xor-ing them with themselves and re-introduces a pseudorandom value s (either s r 0 ⊕ s r 1 or s l 0 ⊕ s l 1 ), which means the xor of their states is now (s, 1) but those states are still indistinguishable from random. on the other hand, if x[i] = α[i], the new state takes the other half of the correction word, so that the xor of the two evaluators states is (0, 0). from there, they have the same states and both have either t = 0 or t = 1. they will continue to apply the same corrections at each step and their states will remain the same with t 0 ⊕ t 1 = 0. a final computation is performed to obtain shared [[t ]] modulo 2 n of the result bit t = t 0 ⊕ t 1 ∈ {0, 1} shared modulo 2. from the privacy point of view, when the seed s is (truly) random, g(s) also looks like a random bit-string (this is a pseudorandom bit-string). each half is used either in the cw or in the next state, but not both. therefore, the correction words cw (i) do not contain information about the expected states and for j = 0, 1, the output k j is independently uniformly distributed with respect to α and 1−j , in a computational way. as a consequence, at the end of the evaluation, for j = 0, 1, t j also follows a distribution independent of α. until the shared values are reconstructed, even a malicious adversary cannot learn anything about α nor the inputs of the other player. function keys should be sent to the evaluators in advance, which requires one extra communication of the size of the keys. we use the trick of [10] to reduce the size of each correction word in the keys, from 2(1 + λ) to (2 + λ) by reusing the pseudo-random λ-bit string dedicated to the state used when leaving the special path for the state used for staying onto it, since for the latter state the only constraint is the pseudo-randomness of the bitstring. our major contribution to the function secret sharing scheme is regarding comparison (which allows to tackle non-polynomial activation functions for neural networks): we build on the idea of the equality test to provide a synthetic and efficient protocol whose structure is very close from the previous one. instead of seeing the special path as a simple path, it can be seen as a frontier for the zone in the tree where x ≤ α. to evaluate x ≤ α, we could evaluate all the paths on the left of the special path and then sum up the results, but this is highly inefficient as it requires exponentially many evaluations. our key idea here is to evaluate all these paths at the same time, noting that each time one leaves the special path, it either falls on the left side (i.e., x < α) or on the right side (i.e., x > α). hence, we only need to add an extra step at each node of the evaluation, where depending on the bit value x[i], we output a leaf label which is 1 only if x[i] < α[i] and all previous bits are identical. only one label between the final label (which corresponds to x = α) and the leaf labels can be equal to one, because only a single path can be taken. therefore, evaluators will return the sum of all the labels to get the final output. the full description of the comparison protocol is detailed in appendix a, together with a detailed explanation of how it works. we now apply these primitives to a private deep learning setup in which a model owner interacts with a data owner. the data and the model parameters are sensitive and are secret shared to be kept private. the shape of the input and the architecture of the model are however public, which is a standard assumption in secure deep learning [34, 36] . all our operations are modular and follow this additive sharing workflow: inputs are provided secret shared and are masked with random values before being revealed. this disclosed value is then consumed with preprocessed function keys to produce a secret shared output. each operation is independent of all surrounding operations, which is known as circuit-independent preprocessing [11] and implies that key generation can be fully outsourced without having to know the model architecture. this results in a fast runtime execution with a very efficient online communication, with a single round of communication and a message size equal to the input size for comparison. preprocessing is performed by a trusted third party to build the function keys. this is a valid assumption in our use case as such third party would typically be an institution concerned about its image, and it is very easy to check that preprocessed material is correct using a cut-and-choose technique [51] . matrix multiplication (matmul). as mentioned by [11] , multiplication fit in this additive sharing workflow. we use beaver triples [2] ]. matrix multiplication is identical but uses matrix beaver triples [14] . relu activation function is supported as a direct application of our comparison protocol, which we combine with a point wise multiplication. convolution can be computed as a single matrix multiplication using an unrolling technique as described in [12] and illustrated in figure 3 in appendix c.2. argmax operator used in classification to determine the predicted label can also be computed in a constant number of rounds using pairwise comparisons as shown by [21] . the main idea here is, given a vector (x 0 , . . . , x m−1 ), to compute the matrix m ∈ r m−1×m where each row m i = (x i+1 mod m , . . . , x i+m+1 mod m ). then, each element of column j is compared to x j , which requires m(m − 1) parallel comparisons. a column j where all elements are lower than x j indicates that j is a valid result for the argmax. maxpool can be implemented by combining these two methods: the matrix is first unrolled like in figure 3 and the maximum of each row in then computed using parallel pairwise comparisons. more details and an optimization when the kernel size k equals 2 are given in appendix c.3. batchnorm is implemented using a approximate division with newton's method as in [48] : given an input x = (x 0 , . . . , x m−1 ) with mean µ and variance σ 2 , we return γ ·θ · ( x − µ) + β. variables γ and β are learnable parameters andθ is the estimate inverse of √ σ 2 + with 1 and is computed iteratively using: θ i+1 = θ i · (3 − (σ 2 + ) · θ 2 i )/2. more details can be found in appendix c.4. more generally, for more complex activation functions such as softmax, we can use polynomial approximations methods, which achieve acceptable accuracy despite involving a higher number of rounds [37, 23, 21] . table 1 summarizes the online communication cost of each operation, and shows that basic operations such as comparison have a very efficient online communication. we also report results from [48] which achieve good experimental performance. these operations are sufficient to evaluate real world models in a fully private way. to also support private training of these models, we need to perform a private backward pass. as we overload operations such as convolutions or activation functions, we cannot use the built-in autograd functionality of pytorch. therefore, we have developed a custom autograd functionality, where we specify how to compute the derivatives of the operations that we have overloaded. backpropagation also uses the same basic blocks than those used in the forward pass. this 2-party protocol between a model owner and a data owner can be extended to an n-party federated learning protocol where several clients contribute their data to a model owned by an orchestrator server. this approach is inspired by secure aggregation [8] but we do not consider here clients being phones which means we are less concerned with parties dropping before the end of the protocol. in addition, we do not reveal the updated model at each aggregation or at any stage, hence providing better privacy than secure aggregation. at the beginning of the interaction, the server and model owner initializes its model and builds n pairs of additive shares of the model parameters. for each pair i, it keeps one of the shares and sends the other one to the corresponding client i. then, the server runs in parallel the training procedure with all the clients until the aggregation phase starts. aggregation for the server shares is straightforward, as the n shares it holds can be simply locally averaged. but the clients have to average their shares together to get a client share of the aggregated model. one possibility is that clients broadcast their shares and compute the average locally. however, to prevent a client colluding with the server from reconstructing the model contributed by a given client, they hide their shares using masking. this can be done using correlated random masks: client i generates a seed, sends it to client i + 1 while receiving one from client i − 1. client i then generates a random mask m i using its seed and another m i−1 using the one of client i − 1 and publishes its share masked with m i − m i−1 . as the masks cancel each other out, the computation will be correct. we follow a setup very close to [48] and assess inference and training performance of several networks on the datasets mnist [33] , cifar-10 [30] , 64×64 tiny imagenet and 224×224 tiny imagenet [49, 42] , presented in appendix d.1. more precisely, we assess 5 networks as in [48] : a fully-connected network (network-1), a small convolutional network with maxpool (network-2), lenet [32] , alexnet [31] and vgg16 [46] . furthermore, we also include resnet18 [22] which to the best of our knowledge has never been studied before in private deep learning. the description of these networks is taken verbatim from [48] and is available in appendix d.2. our implementation is written in python. to use our protocols that only work in finite groups like z 2 32 , we convert our input values and model parameters to fixed precision. to do so, we rely on the pysyft library [43] protocol. however, our inference runtimes reported in table 2 compare favourably with existing work including [34-36, 47, 48] , in the lan setting and particularly in the wan setting thanks to our reduced number of communication rounds. for example, our implementation of network-1 is 2× faster than the best previous result by [35] in the lan setting and 18× faster in the wan setting compared to [48] . for bigger networks such as alexnet on cifar-10, we are still 13× faster in the wan setting than [48] . results are given for a batched evaluation, which allows parallelism and hence faster execution as in [48] . for larger networks, we reduce the batch size to have the preprocessing material (including the function keys) fitting into ram. test accuracy. thanks to the flexibility of our framework, we can train each of these networks in plain text and need only one line of code to turn them into private networks, where all parameters are secret shared. we compare these private networks to their plaintext counterparts and observe that the accuracy is well preserved as shown in table 3 . if we degrade the encoding precision, which by default considers values in z 2 32 , and the fixed precision which is by default of 3 decimals, performance degrades as shown in appendix b. training. we can either train from scratch those networks or fine tune pre-trained models. training is an end-to-end private procedure, which means the loss and the gradients are never accessible in plain text. we use stochastic gradient descent (sgd) which is a simple but popular optimizer, and support both hinge loss and mean square error (mse) loss, as other losses like cross entropy which is used in clear text by [48] cannot be computed over secret shared data without approximations. we report runtime and accuracy obtained by training from scratch the smaller networks in table 4 . note that because of the number of epochs, the optimizer and the loss chosen, accuracy does not match best known results. however, the training procedure is not altered and the trained model will be strictly equivalent to its plaintext counterpart. training cannot complete in reasonable time for larger networks, which are anyway available pre-trained. note that training time includes the time spent building the preprocessing material, as it cannot be fully processed in advance and stored in ram. discussion. for larger networks, we could not use batches of size 128. this is mainly due to the size of the comparison function keys which is currently proportional to the size of the input tensor, with a multiplication factor of nλ where n = 32 and λ = 128. optimizing the function secret sharing protocol to reduce those keys would lead to massive improvements in the protocol's efficiency. our implementation actually has more communication than is theoretically necessary according to table 1 , suggesting that the experimental results could be further improved. as we build on top of pytorch, using machines with gpus could also potentially result in a massive speed-up, as an important fraction of the execution time is dedicated to computation. last, accuracies presented in table 3 and table 4 do not match state-of-the-art performance for the models and datasets considered. this is not due to internal defaults of our protocol but to the simplified training procedure we had to use. supporting losses such as the logistic loss, more complex optimizers like adam and dropout layers would be an interesting follow-up. one can observe the great similarity of structure of the comparison protocol given in algorithm 3 and 4 with the equality protocol from algorithm 1 and 2: the equality test is performed in parallel with an additional information out i at each node, which holds a share of either 0 when the evaluator stays on the special path or if it has already left it at a previous node, or a share of α[i] when it leaves the special path. this means that if α[i] = 1, leaving the special path implies that x[i] = 0 and hence x ≤ α, while if α[i] = 0, leaving implies x[i] = 1 so x > α and the output should be 0. the final share out n+1 corresponds the previous equality test. note that in all these computations modulo 2 n , while the bitstrings s j · cw (i) ) = ((state j,0 , state j,1 ), (state j,0 , state j,1 )) 9 parse s we have studied the impact of lowering the encoding space of the input to our function secret sharing protocol from z 2 32 to z 2 k with k < 32. finding the lowest k guaranteeing good performance is an interesting challenge as the function keys size is directly proportional to it. this has to be done together with reducing fixed precision from 3 decimals down to 1 decimal to ensure private values aren't too big, which would result in higher failure rate in our private comparison protocol. we have reported in table 5 our findings on network-1, which is pre-trained and then evaluated in a private fashion. table 5 : accuracy (in %) of network-1 given different precision and encoding spaces what we observe is that 3 decimals of precision is the most appropriate setting to have an optimal precision while allowing to slightly reduce the encoding space down to z 2 24 or z 2 28 . because this is not a massive gain and in order to keep the failure rate in comparison very low, we have kept z 2 32 for all our experiments. c implementation details our comparison protocol can fail if y + α wraps around and becomes negative. we can't act on α because it must be completely random to act as a perfect mask and to make sure the revealed x = y + α mod 2 n does not leak any information about y, but the smaller y is, the lower the error probability will be. [11] suggests a method which uses 2 invocations of the protocol to guarantee perfect correctness but because it incurs an important runtime overhead, we rather show that the failure rate of our comparison protocol is very small and is reasonable in contexts that tolerate a few mistakes, as in machine learning. more precisely, we quantify it on real world examples, namely on network-2 and on the 64×64 tiny imagenet version of vgg16, with a fixed precision of 3 decimals, and find respective failure rates of 1 in 4 millions comparisons and 1 in 100 millions comparisons. such error rates do not affect the model accuracy, as table 3 shows. figure 4 illustrates how maxpool uses ideas from matrix unrolling and argmax computation. notations present in the figure are consistent with the explanation of argmax using pairwise comparison in section 4.3. the m × m matrix is first unrolled to a m 2 × k 2 matrix. it is then expanded on k 2 layers, each of which each shifted by a step of 1. next, m 2 k 2 (k 2 − 1) pairwise comparisons are then applied simultaneously between the first layer and the other ones, and for each x i we sum the result of its k − 1 comparison and check if it equals k − 1. we multiply this boolean by x i and sum up along a line (like x 1 to x 4 in the figure) . last, we restructure the matrix back to its initial structure. in addition, when the kernel size k is 2, rows are only of length 4 and it can be more efficient to use a binary tree approach instead, i.e. compute the maximum of columns 0 and 1, 2 and 3 and the max of the result: it requires log 2 (k 2 ) = 2 rounds of communication and only approximately (k 2 − 1)(m/s) 2 comparisons, compared to a fixed 3 rounds and approximately k 4 (m/s) 2 . interestingly, average pooling can be computed locally on the shares without interaction because it only includes mean operations, but we didn't replace maxpool operations with average pooling to avoid distorting existing neural networks architecture. the batchnorm layer is the only one in our implementation which is a polynomial approximation. moreover, compared to [48] , the approximation is significantly coarser as we don't make any costly initial approximation and we reduce the number of iterations of the newton method from 4 to only 3. typical relative error can be up to 20% but as the primary purpose of batchnorm is to normalise data, having rough approximations here is not an issue and doesn't affect learning capabilities, as our experiments show. however, it is a limitation for using pre-trained networks: we observed on alexnet adapted to cifar-10 that training the model with a standard batchnorm and evaluating it with our approximation resulted in poor results, so we had to train it with the approximated layer. this section is taken almost verbatim from [48] . we select 4 datasets popularly used for training image classification models: mnist [33] , cifar-10 [30] , 64×64 tiny imagenet and 224×224 tiny imagenet [49] . mnist mnist [33] is a collection of handwritten digits dataset. it consists of 60,000 images in the training set and 10,000 in the test set. each image is a 28×28 pixel image of a handwritten digit along wit a label between 0 and 9. we evaluate network-1, network-2, and the lenet network on this dataset. cifar-10 cifar-10 [30] consists of 50,000 images in the training set and 10,000 in the test set. it is composed of 10 different classes (such as airplanes, dogs, horses etc.) and there are 6,000 images of each class with each image consisting of a colored 32×32 image. we perform private training of alexnet and inference of vgg16 on this dataset. tiny imagenet tiny imagenet [49] consists of two datasets of 100,000 training samples and 10,000 test samples with 200 different classes. the first dataset is composed of colored 64×64 images and we use it with alexnet and vgg16. the second is composed of colored 224×224 images and is used with resnet18. we have selected 6 models for our experimentations. network-1 a 3-layered fully-connected network with relu used in secureml [36] . network-2 a 4-layered network selected in minionn [34] with 2 convolutional and 2 fullyconnected layers, which uses maxpool in addition to relu activation. lenet this network, first proposed by lecun et al. [32] , was used in automated detection of zip codes and digit recognition. the network contains 2 convolutional layers and 2 fully connected layers. alexnet alexnet is the famous winner of the 2012 imagenet ilsvrc-2012 competition [31] . it has 5 convolutional layers and 3 fully connected layers and it can batch normalization layer for stability and efficient training. vgg16 vgg16 is the runner-up of the ilsvrc-2014 competition [46] . vgg16 has 16 layers and has about 138m parameters. resnet18 resnet18 [22] is the runner-up of the ilsvrc-2015 competition. it is a convolutional neural network that is 18 layers deep, and has 11.7m parameters. it uses batch normalisation and we're the first private deep learning framework to evaluate this network. model architectures of network-1 and network-2, together with lenet, and the adaptations for cifar-10 of alexnet and vgg16 are precisely depicted in appendix d of [48] . note that in the cifar-10 version alexnet, authors have used the version with batchnorm layers, and we have kept this choice. for the 64×64 tiny imagenet version of alexnet, we used the standard architecture from pytorch to have a pretrained network. it doesn't have batchnorm layers, and we have adapted the classifier part as illustrated in figure 5 . note also that we permute relu and maxpool where applicable like in [48] , as this is strictly equivalent in terms of output for the network and reduces the number of comparisons. more generally, we don't proceed to any alteration of the network behaviour except with the approximation on batchnorm. this improves usability of our framework as it allows to take a pre-trained neural network from a standard deep learning library like pytorch and to encrypt it generically with a single line of code. privacy-preserving machine learning: threats and solutions efficient multiparty protocols using circuit randomization optimizing semi-honest secure multiparty computation for the internet ngraph-he2: a high-throughput framework for neural network inference on encrypted data ngraph-he: a graph compiler for deep learning on homomorphically encrypted data sharemind: a framework for fast privacypreserving computations towards federated learning at scale: system design practical secure aggregation for privacy-preserving machine learning function secret sharing function secret sharing: improvements and extensions secure computation with preprocessing via function secret sharing high performance convolutional neural networks for document processing faster fully homomorphic encryption: bootstrapping in less than 0.1 seconds private image analysis with mpc. accessed 2019-11-01 multiparty computation from somewhat homomorphic encryption aby-a framework for efficient mixed-protocol secure two-party computation a survey of secure multiparty computation protocols for privacy preserving genetic tests model inversion attacks that exploit confidence information and basic countermeasures cryptonets: applying neural networks to encrypted data with high throughput and accuracy foundations of cryptography deep residual learning for image recognition accuracy and stability of numerical algorithms deep models under the gan: information leakage from collaborative deep learning chiron: privacy-preserving machine learning as a service {gazelle}: a low latency framework for secure neural network inference an efficient multi-party scheme for privacy preserving collaborative filtering for healthcare recommender system overdrive: making spdz great again federated learning: strategies for improving communication efficiency the cifar-10 dataset imagenet classification with deep convolutional neural networks gradient-based learning applied to document recognition mnist handwritten digit database oblivious neural network predictions via minionn transformations aby3: a mixed protocol framework for machine learning secureml: a system for scalable privacy-preserving machine learning an improved newton iteration for the generalized inverse of a matrix, with applications information technology-based tracing strategy in response to covid-19 in south korea-privacy controversies privacy-preserving contact tracing of covid-19 patients chameleon: a hybrid secure computation framework for machine learning applications deepsecure: scalable provably-secure deep learning imagenet large scale visual recognition challenge a generic framework for privacy preserving deep learning privacy-preserving deep learning very deep convolutional networks for large-scale image recognition securenn: efficient and private neural network training falcon: honest-majority maliciously secure framework for private deep learning tiny imagenet challenge how to generate and exchange secrets the cut-and-choose game and its application to cryptographic protocols we would like to thank geoffroy couteau, chloé hébant and loïc estève for helpful discussions throughout this project. we are also grateful for the long-standing support of the openmined community and in particular its dedicated cryptography team, including yugandhar tripathi, s p sharan, george-cristian muraru, muhammed abogazia, alan aboudib, ayoub benaissa, sukhad joshi and many others.this work was supported in part by the european community's seventh framework programme (fp7/2007-2013 grant agreement no. 339563 -cryptocloud) and by the french project fui anblic. the computing power was graciously provided by the french company arkhn. key: cord-186031-b1f9wtfn authors: caldarelli, guido; nicola, rocco de; petrocchi, marinella; pratelli, manuel; saracco, fabio title: analysis of online misinformation during the peak of the covid-19 pandemics in italy date: 2020-10-05 journal: nan doi: nan sha: doc_id: 186031 cord_uid: b1f9wtfn during the covid-19 pandemics, we also experience another dangerous pandemics based on misinformation. narratives disconnected from fact-checking on the origin and cure of the disease intertwined with pre-existing political fights. we collect a database on twitter posts and analyse the topology of the networks of retweeters (users broadcasting again the same elementary piece of information, or tweet) and validate its structure with methods of statistical physics of networks. furthermore, by using commonly available fact checking software, we assess the reputation of the pieces of news exchanged. by using a combination of theoretical and practical weapons, we are able to track down the flow of misinformation in a snapshot of the twitter ecosystem. thanks to the presence of verified users, we can also assign a polarization to the network nodes (users) and see the impact of low-quality information producers and spreaders in the twitter ecosystem. propaganda and disinformation have a history as long as mankind, and the phenomenon becomes particularly strong in difficult times, such as wars and natural disasters. the advent of the internet and social media has amplified and made faster the spread of biased and false news, and made targeting specific segments of the population possible [7] . for this reason the vice-president of the european commission with responsibility for policies on values and transparency, vȇra yourová, announced, beginning of june 2020, a european democracy action plan, expected by the end of 2020, in which web platforms admins will be called for greater accountability and transparency, since 'everything cannot be allowed online' [16] . manufacturers and spreaders of online disinformation have been particularly active also during the covid-19 pandemic period (e.g., writing about bill gates role in the pandemics or about masks killing children [2, 3] ). this, alongside the real pandemics [17] , has led to the emergence of a new virtual disease: covid-19 infodemics. in this paper, we shall consider the situation in italy, one of the most affected countries in europe, where the virus struck in a devastating way between the end of february and the end of april [1] . in such a sad and uncertain time, propaganda [1] in italy, since the beginning of the pandemics and at time of writing, almost 310k persons have contracted the covid-19 virus: of these, more than 35k have died. source: http://www.protezionecivile.gov.it/. accessed september 28, 2020. has worked hard: one of the most followed fake news was published by sputnik italia receiving 112,800 likes, shares and comments on the most popular social media. 'the article falsely claimed that poland had not allowed a russian plane with humanitarian aid and a team of doctors headed to italy to fly over its airspace', the ec vice-president yourová said. actually, the studies regarding dis/mis/information diffusion on social media seldom analyse its effective impact. in the exchange of messages on online platforms, a great amount of interactions do not carry any relevant information for the understanding of the phenomenon: as an example, randomly retweeting viral posts does not contribute to insights on the sharing activity of the account. for determining dis/misinformation propagation two main weapons can be used, the analysis of the content (semantic approach) and the analysis of the communities sharing the same piece of information (topological approach). while the content of a message can be analysed on its own, the presence of some troublesome structure in the pattern of news producer and spreaders (i.e., in the topology of contacts) can be detected only trough dedicated instruments. indeed, for real in-depth analyses, the properties of the real system should be compared with a proper null model. recently, entropy-based null models have been successfully employed to filter out random noise from complex networks and focus the attention on non trivial contributions [10, 26] . essentially, the method consists in defining a 'network benchmark' that has some of the (topological) properties of the real system, but is completely random for all the rest. then, every observation that does not agree with the model, i.e., cannot be explained by the topological properties of the benchmark, carries non trivial information. notably, being based on the shannon entropy, the benchmark is unbiased by definition. in the present paper, using entropy-based null-models, we analyse a tweet corpus related to the italian debate on covid-19 during the two months of maximum crisis in italy. after cleaning the system from the random noise, by using the entropy-based null-model as a filter, we have been able to highlight different communities. interestingly enough, these groups, beside including several official accounts of ministries, health institutions, and -online and offline -newspapers and newscasts, encompass four main political groups. while at first sight this may sound surprising -the pandemic debate was more on a scientific than on a political ground, at least in the very first phase of its abrupt diffusion -, it might be due to pre-existing echo chambers [18] . the four political groups are found to perform completely different activities on the platform, to interact differently from each other, and to post and share reputable and non reputable sources of information with great differences in the number of their occurrences. in particular, the accounts from the right wing community interact, mainly in terms of retweets, with the same accounts who interact with the mainstream media. this is probably due to the strong visibility given by the mainstream media to the leaders of that community. moreover, the right wing community is more numerous and more active, even relatively to the number of accounts involved, than the other communities. interestingly enough, newly formed political parties, as the one of the former italian prime minister matteo renzi, quickly imposed their presence on twitter and on the online political debate, with a strong activity. furthermore, the different political parties use different sources for getting information on the spreading on the pandemics. to detect the impact of dis/misinformation in the debate, we consider the news sources shared among the accounts of the various groups. with a hybrid annotation approach, based on independent fact checking organisations and human annotation, we categorised such sources as reputable and non reputable (in terms of credibility of the published news and the transparency of the sources). notably, we experienced that a group of accounts spread information from non reputable sources with a frequency almost 10 times higher than that of the other political groups. and we are afraid that, due to the extent of the online activity of the members of this community, the spreading of such a volume of non reputable news could deceit public opinion. we collected circa 4.5m tweets in italian language, from february 21 st to april 20 th 2020 [2] . details about the political situation in italy during the period of data collection can be found in the supplementary material, section 1.1: 'evolution of the covid-19 pandemics in italy'. the data collection was keyword-based, with keywords related the covid-19 pandemics. twitter's streaming api returns any tweet containing the keyword(s) in the text of the tweet, as well as in its metadata. it is worth noting that it is not always necessary to have each permutation of a specific keyword in the tracking list. for example, the keyword 'covid' will return tweets that contain both 'covid19' and 'covid-19'. table 1 lists a subset of the considered keywords and hashtags. there are some hashtags that overlap due to the fact that an included keyword is a sub-string of another one, but we included both for completeness. the left panel of fig. 1 shows the network obtained by following the projection procedure described in section 5.1. the network resulting from the projection procedure will be called, in the rest of the paper, validated network. the term validated should not be confused with the term verified, which instead denotes a twitter user who has passed the formal authentication procedure by the social platform. in order to get the community of verified twitter users, we applied the louvain algorithm [5] to the data in the validated network. such an algorithm, despite being one of the most popular, is also known to be order dependent [19] . to get rid of this bias, we apply it iteratively n times (n being the number of the nodes) after reshuffling the order of the nodes. finally, we select the partition with the highest modularity. the network presents a strong community structure, composed by four main subgraphs. when analysing the emerging 4 communities, we find that they correspond to 1 right wing parties and media (in steel blue) 2 center left wing (dark red) 3 5 stars movement (m5s ), in dark orange 4 institutional accounts (in sky blue) details about the political situation in italy during the period of data collection can be found in the supplementary material, section 1.2: 'italian political situation during the covid-19 pandemics'. this partition in four subgroups, once examined in more details, presents a richer substructure, described in the right panel of fig. 1 . starting from the center-left wing, we can find a darker red community, including various ngos and various left oriented journalists, vips and pundits. a slightly lighter red sub-community turns out to be composed by the main politicians of the italian democratic party (pd), as well as by representatives from the european parliament (italian and others) and some eu commissioners. the violet red group is mostly composed by the representatives of italia viva, a new party founded by the former italian prime minister matteo renzi (december 2014 -february 2016). in golden red we can find the subcommunity of catholic and vatican groups. finally the dark violet red and light tomato subcommunities consist mainly of journalists. in turn, also the orange (m5s) community shows a clear partition in substructures. in particular, the dark orange subcommunity contains the accounts of politicians, parliament representatives and ministers of the m5s and journalists. in aquamarine, we can find the official accounts of some private and public, national and international, health institutes. finally, in the light slate blue subcommunity we can find various italian ministers as well as the italian police and army forces. similar considerations apply to the steel blue community. in steel blue, the subcommunity of center right and right wing parties (as forza italia, lega and fratelli d'italia). in the following, this subcommunity is going to be called as fi-l-fdi, recalling the initials of the political parties contributing to this group. the sky blue subcommunity includes the national federations of various sports, the official accounts of athletes and sport players (mostly soccer) and their teams. the teal subcommunity contains the main italian news agencies. in this subcommunity there are also the accounts of many universities. the firebrick subcommunity contains accounts related to the as roma football club; analogously in dark red official accounts of ac milan and its players. the slate blue subcommunity is mainly composed by the official accounts of radio and tv programs of mediaset, the main private italian broadcasting company. finally, the sky blue community is mainly composed by italian embassies around the world. for the sake of completeness, a more detailed description of the composition of the subcommunities in the right panel of figure 1 is reported in the supplementary material, section 1.3: 'composition of the subcommunities in the validated network of verified twitter users'. here, we report a series of analyses related to the domain names, hereafter simply called domains, that mostly appear in all the tweets of the validated network of verified users. the domains have been tagged according to their degree of credibility and transparency, as indicated by the independent software toolkit newsguard https://www.newsguardtech.com/. the details of this procedure are reported below. as a first step, we considered the network of verified accounts, whose communities and sub-communities are shown in fig. 1 . on this topology, we labelled all domains that had been shared at least 20 times (between tweets and retweets). table 2 shows the tags associated to the domains. in the rest of the paper, we shall be interested in quantifying reliability of news sources publishing during the period of interest. thus, for our analysis, we will not consider those sources corresponding to social networks, marketplaces, search engines, institutional sites, etc. tags r, ∼ r and nr in table 2 are used only for news sites, be them newspapers, magazines, tv or radio social channels, and they stand for reputable, quasi reputable, not reputable, respectively. label unc is assigned to those domains with less than 20 occurrences in ours tweets and rewteets dataset. in fact, the labeling procedure is a hybrid one. as mentioned above, we relied on newsguard, a plugin resulting from the joint effort of journalists and software table 2 tags used for labeling the domains developers aiming at evaluating news sites according to nine criteria concerning credibility and transparency. for evaluating the credibility level, the metrics consider whether the news source regularly publishes false news, does not distinguish between facts and opinions, does not correct a wrongly reported news. for transparency, instead, the tool takes into account whether owners, founders or authors of the news source are publicly known; and whether advertisements are easily recognizable [3] . after combining the individual scores obtained out of the nine criteria, the plugin associates to a news source a score from 1 to 100, where 60 is the minimum score for the source to be considered reliable. when reporting the results, the plugin provides details about the criteria which passed the test and those that did not. in order to have a sort of no-man's land and not to be too abrupt in the transition between reputability and non-reputability, when the score was between 55 and 65, we considered the source to be quasi reputable, ∼r. it is worth noting that not all the domains in the dataset under investigation were evaluated by newsguard at the time of our analysis. for those not evaluated automatically, the annotation was made by three tech-savvy researchers, who assessed the domains by using the same criteria as newsguard. table 3 gives statistics about number and kind of tweets (tw = pure tweet; rt = retweet), the number of url and distinct url (dist url), the number of domains and users in the validated network of verified users. we clarify what we mean by these terms with an example: a domain for us corresponds to the so-called 'second-level domain' name [4] , i.e., the name directly to the left of .com, .net, and any other top-level domains. for instance, repubblica.it, corriere.it, nytimes.com are considered domains by us. instead, the url maintains here its standard definition [5] and an example is http://www.example.com/index.html. table 4 shows the outcome of the domains annotation, according to the scores of newsguard or to those assigned by the three annotators, when scores were no available from newsguard. at a first glance, the majority of the news domains belong to the reputable category. the second highest percentage is the one of the untagged domains -unc. in fact, in our dataset there are many domains that occur only few times once. for example, there are 300 domains that appear in the datasets only once. fig. 2 shows the trend of the number of tweets and retweets, containing urls, posted by the verified users of the validated projection during the period of data [3] newsguard rating process: https://www.newsguardtech.com/ratings/rating-process-criteria/ [4] https://en.wikipedia.org/wiki/domain_name [5] table 4 annotation results over all the domains in the whole dataset -validated network of verified users. in [9] . going on with the analysis, table 5 shows the percentage of the different types of domains for the 4 communities identified in the left plot of fig. 1 . it is worth observing that the steel blue community (both politicians and media) is the most active one, even if it is not the most represented: the number of users is lower than the one of the center left community (the biggest one, in terms of numbers), but the number of their posts containing a valid url is almost the double of that of the second more active community. interestingly, the activity of the verified users of the steel blue community is more focused on content production of (see the only tweets sub-table) than in sharing (see the only retweets sub-table). in fact, retweets represent almost 14.6% of all posts from the media and the right wing community, while in the case of the center-left community it is 34.5%. this effect is observable even in the average only tweets post per verified user: a right-wing user and a media user have an average of 88.75 original posts, against 34.27 for center-left-wing users. these numbers are probably due to the presence in the former community of the italian most accessed media. they tend to spread their (original) pieces of news on the twitter platform. interestingly, the presence of urls from a non reputable source in the steel blue community is more than 10 times higher than the second score in the same field in the case of original tweets (only tweets). it is worth noting that, for the case of the dark orange and sky blue communities, which are smaller both in terms of users and number of posts, the presence of non classified sources is quite strong (it represents nearly 46% of retweeted posts for both the communities), as it is the frequency of posts linking to social network contents. interestingly enough, the verified users of both groups seem to focus slightly more on the same domains: there are, on average, 1.59 and 1.80 posts for each url domain respectively for the dark orange and sky blue communities, and, on average, 1.26 and 1.34 posts for the steel blue and the dark red communities. the right plot in fig. 1 report a fine grained division of communities: the four largest communities have been further divided into sub-communities, as mentioned in subsection 3.1. here, we focus on the urls shared in the purely political sub-communities in table 7 . broadly speaking, we examine the contribution of the different political parties, as represented on twitter, to the spread of mis/disinformation and propaganda. table 7 clearly shows how the vast majority of the news coming from sources considered scarce or non reputable are tweeted and retweeted by the steel blue political sub-community (fi-l-fdi). notably, the percentage of non reputable sources shared by the fi-l-fdi accounts is more than 4 times the percentage of their community (the steel blue one) and it is more than 30 times the second community in the nr ratio ranking. for all the political sub-communities the incidence of social network links is much higher than in their original communities. looking at table 8 , even if the number of users in each political sub-community is much smaller, some peculiar behaviours can be still be observed. again, the center-right and right wing parties, while representing the least represented ones in terms of users, are much more active than the other groups: each (verified) user is responsible, on average of almost 81.14 messages, while the average is 23.96, 22.12 and 15.29 for m5s, iv and pd, respectively. it is worth noticing that italia viva, while being a recently founded party, is very active; moreover, for them the frequency of quasi reputable sources is quite high, especially in the case of only tweets posts. the impact of uncategorized sources is almost constant for all communities in the retweeting activity, while it is particularly strong for the m5s. finally, the posts by the center left communities (i.e., italia viva and the democratic party) tend to have more than one url. specifically, every post containing at least a url, has, on average, 2.05 and 2.73 urls respectively, against the 1.31 of movimento 5 stelle and 1.20 for the center-right and right wing parties. to conclude the analysis on the validated network of verified users, we report statistics about the most diffused hashtags in the 4 political sub-communities. fig. 3 focuses on wordclouds, while fig. 4 reports the data under an histograms form. actually, from the various hashtags we can derive important information regarding the communications of the various political discursive communities and their position towards the management of the pandemics. first, it has to be noticed that the m5s is the greatest user of hashtags: their two most used hashtags have been used almost twice the most used hashtags used by the pd, for instance. this heavy usage is probably due to the presence in this community of journalists and of the official account of il fatto quotidiano, a newspaper explicitly supporting the m5s: indeed, the first two hashtags are "#ilfattoquotidiano" and "#edicola" (kiosk, in italian). it is interesting to see the relative importance of hashtags intended to encourage the population during the lockdown: it is the case of "#celafaremo" (we will make it), "#iorestoacasa" (i am staying home), "#fermiamoloinsieme" (let's stop it together ): "#iorestoacasa" is present in every community, but it ranks 13th in the m5s verified user community, 29th in the fi-l-fdi community, 2nd in the italia viva community and 10th in the pd one. remarkably, "#celafaremo" is present only in the m5s group, as "#fermiamoloinsieme" can be found in the top 30 hashtags only in the center-right and right wing cluster. the pd, being present in various european institutions, mentions more european related hashtags ("#europeicontrocovid19", europeans against covid-19 ), in order to ask for a common reaction of the eu. the center-right and right wing community has other hashtags as "#forzalombardia" (go, lombardy! ), ranking the 2nd, and "#fermiamoloinsieme", ranking 10th. what is, nevertheless, astonishing, is the presence among the most used hashtags of all communities of the name of politicians from the same group ('interestingly '#salvini" is the first used hashtag in the center right and right wing community, even if he did not perform any duty in the government), tv programs ("#mattino5", "#lavitaindiretta", "#ctcf", "#dimartedì"), as if the main usage of hashtags is to promote the appearance of politicians in tv programs. finally, the hashtags used by fi-l-fdi are mainly used to criticise the actions of the government, e.g., "#contedimettiti" (conte, resign! ). fig. 5 shows the structure of the directed validated projection of the retweet activity network, as outcome of the procedure recalled in section 3 of the supplementary material. as mentioned in section 4 of the supplementary material, the affiliation of unverified users has been determined using the tags obtained by the validated projected network of the verified users, as immutable label for the label propagation of [23] . after label propagation, the representation of the political communities in the validated retweet network changes dramatically with respect to the case of the network of verified users: the center-right and right wing community is the most represented community in the whole network, with 11063 users (representing 21.1% of all the users in the validated network), followed by italia viva users with 8035 accounts (15.4% of all the accounts in the validated network). the impact of m5s and pd is much more limited, with, respectively, 3286 and 564 accounts. it is worth noting that this result is unexpected, due to the recent formation of italia viva. as in our previous study targeting the online propaganda [8] , we observe that the most effective users in term of hub score [21] are almost exclusively from the center-right and right wing party: considering the first 100 hubs, only 4 are not from this group. interestingly, 3 out of these 4 are verified users: roberto burioni, one of the most famous italian virologists, ranking 32nd, agenzia ansa, a popular italian news agency, ranking 61st, and tgcom24, the popular newscast of a private tv channel, ranking 73rd. the fourth account is an online news website, ranking 88th: this is a not verified account which belongs to a not political community. remarkably, in the top 5 hubs we find 3 of the top 5 hubs already found when considered the online debate on migrations from northern africa to italy [8] : in particular, a journalist of a neo-fascist online newspaper (non verified user), an extreme right activist (non verified user) and the leader of fratelli d'italia giorgia meloni (verified user), who ranks 3rd in the hub score. matteo salvini (verified user), who was the first hub in [8] , ranks 9th, surpassed by his party partner claudio borghi, ranking 6th. the first hub in the present network is an extreme right activist, posting videos against african migrants to italy and accusing them to be responsible of the contagion and of violating lockdown measures. table 9 shows the annotation results of all the domains tweeted and retweeted by users in the directed validated network. the numbers are much higher than those shown in table 2 , but the trend confirms the previous results. the majority of urls traceable to news sources are considered reputable. the number of unclassified domains is higher too. in fact, in this case, the annotation was made considering the domains occurring at least 100 times. table 9 annotation results over all the domains -directed validated network table 10 reports statistics about posts, urls, distinct urls, users and verified users in the directed validated network. noticeably, by comparing these numbers with those of table 3 , reporting statistics about the validated network of verified users, we can see that here the number of retweets is much more higher, and the trend is the opposite: verified users tend to tweet more than retweet (46277 vs 17190), while users in the directed validated network, which comprehends also non verified users, have a number of retweets 3.5 times higher than the number of their tweets. fig. 6 shows the trend of the number of tweets containing urls over the period of data collection. since we are analysing a bigger network than the one considered in section 3.2, we have numbers that are one order of magnitude greater than those shown in fig. 2 ; the highest peak, after the discovery of the first cases in lombardy, corresponds to more than 68,000 posts containing urls, whereas the analogous peak in fig. 2 corresponds to 2,500 posts. apart from the order of magnitudes, the two plots feature similar trends: higher traffic before the beginning of the italian lockdown, and a settling down as the quarantine went on [6] . table 11 shows the core of our analysis, that is, the distribution of reputable and non reputable news sources in the direct validated network, consisting of both verified and non-verified users. again, we focus directly on the 4 political sub-communities identified in the previous subsection. two of the sub-communities are part of the center-left wing community, one is associated to the 5 stars movement, the remaining one represents center-right and right wing communities. in line with previous results on the validated network of verified users, the table clearly shows how the vast majority of the news coming from sources considered scarce or non reputable are tweeted and retweeted by the center-right and right wing communities; 98% of the domains tagged as nr are shared by them. as shown in table 12 , the activity of fi-l-fdi users is again extremely high: on average there are 89.3 retweets per account in this community, against the 66.4 of m5s, the 48.4 of iv and the 21.8 of pd. the right wing contribution to the debate is extremely high, even in absolute numbers, due to the the large number of users in this community. it is worth mentioning that the frequency of non reputable sources in this community is really high (at about 30% of the urls in the only tweets) and comparable with that of the reputable ones (see table 11 , only [6] the low peaks for february 27 and march 10 are due to an interruption in the data collection, caused by a connection breakdown. table 11 domains annotation per political sub-communities -directed validated network tweets). in the other sub-communities, pd users are more focused on un-categorised sources, while users from both italia viva and movimento 5 stelle are mostly tweeting and retweeting reputable news sources. and users, but also in absolute numbers: out of the over 1m tweets, more than 320k tweets refer to a nr url. actually, the political competition still shines through the hashtag usage even for the other communities: it is the case, for instance, of italia viva. in the top 30 hashtags we can find '#salvini', '#lega', but also '#papeete' [7] , '#salvinisciacallo' (salvini jackal ) and '#salvinimmmerda' (salvini asshole). on the other hand, in italia viva hashtags supporting the population during the lockdown are used: '#iorestoacasa', '#restoacasa' (i am staying home), '#restiamoacasa' (let's stay home). criticisms towards the management of lombardy health system during the pandemics can be deduced from the hashtag '#commissariamtelalombardia' (put lombardy under receivership) and '#fontana' (the lega administrator of the lombardy region). movimento 5 stelle has the name of the main leader of the opposition '#salvini', as first hashtag and supports criticisms to the lombardy administration with the hashtags '#fontanadimettiti' (fontana, resign! ) and '#gallera', the health and welfare minister of the lombardy region, considered the main responsible for the bad management of the pandemics. nevertheless, it is possible to highlight even some hashtags encouraging the population during the lock down, as the above mentioned '#iorestoacasa', '#restoacasa' and '#restiamoacasa'. it is worth mentioning that the government measures, and the corresponding m5s campaigns, are accompanied specific hashtags: '#curaitalia' is the name of one of the decree of the prime minister to inject liquidity in the italian economy, '#acquistaitaliano' (buy italian products! ), instead, advertise italian products to support the national economy. as a final task, over the whole set of tweets produced or shared by the users in the directed validated network, we counted the number of times a message containing a url was shared by users belonging to different political communities, although without considering the semantics of the tweets. namely, we ignored whether the urls were shared to support or to oppose the presented arguments. table 14 shows the most tweeted (and retweeted) nr domains shared by the political communities presented in table 7 , the number of occurrences is reported next to each domain. the first nr domains for fi-l-fdi in table 14 are related to the right, extreme right and neo-fascist propaganda, as it is the case of imolaoggi.it, ilprimatonazionale.it and voxnews.info, recognised as disinformation websites by newsguard and by the two main italian debunker websites, bufale.net and butac.it. as shown in the table, some domains, although in different number of occurrences, are present under more than one column, thus shared by users close to different political communities. this could mean, for some subgroups of the community, a retweet with the aim of supporting the opinions expressed in the original tweets. however, since the semantics of the posts in which these domains are present were not investigated, the retweets of the links by more than one political community could be due to contrast, and not to support, the opinions present in the original posts. despite the fact that the results were achieved for a specific country, we believe that the applied methodology is of general interest, being able to show trends and peculiarities whenever information is exchanged on social networks. in particular, when analysing the outcome of our investigation, some features attracted our attention: 1 persistence of clusters wrt different discussion topics: in caldarelli et al. [8] , we focused on tweets concerned with immigration, an issue that has been central in the italian political debate for years. here, we discovered that the clusters and the echo chambers that have been detected when analysing tweets about immigration are almost the same as those singled out when considering discussions concerned with covid-19. this may seem surprising, because a discussion about covid-19 may not be exclusively political, but also medical, social, economic, etc.. from this we can argue that the clusters are political in nature and, even when the topic of discussion changes, users remain in their cluster on twitter. (indeed, journalists and politicians use twitter for information and political propaganda, respectively). the reasons political polarisation and political vision of the world affect so strongly also the analysis of what should be an objective phenomenon is still an intriguing question. 2 persistence of online behavioral characteristics of clusters: we found that the most active, lively and penetrating online communities in the online debate on covid-19 are the same found in [8] , formed in a almost purely political debate such as the one represented by the right of migrants to land on the italian territory. 3 (dis)similarities amongst offline and online behaviours of members and voters of parties: maybe less surprisingly, the political habits is also reflected in the degree of participation to the online discussions. in particular, among the parties in the centre-left-wing side, a small party (italia viva) shows a much more effective social presence than the larger party of the italian centre-left-wing (partito democratico), which has many more active members and more parliamentary representation. more generally, there is a significant difference in social presence among the different political parties, and the amount of activity is not at all proportional to the size of the parties in terms of members and voters. 4 spread of non reputable news sources: in the online debate about covid-19, many links to non reputable (defined such by newsguard, a toolkit ranking news website based on criteria of transparency and credibility, led by veteran journalists and news entrepreneurs) news sources are posted and shared. kind and occurrences of the urls vary with respect to the corresponding political community. furthermore, some of the communities are characterised by a small number of verified users that corresponds to a very large number of acolytes which are (on their turn) very active, three times as much as the acolytes of the opposite communities in the partition. in particular, when considering the amount of retweets from poorly reputable news sites, one of the communities is by far (one order of magnitude) much more active than the others. as noted already in our previous publication [8] , this extra activity could be explained by a more skilled use of the systems of propaganda -in that case a massive use of bot accounts and a targeted activity against migrants (as resulted from the analysis of the hub list). our work could help in steering the online political discussion around covid-19 towards an investigation on reputable information, while providing a clear indication of the political inclination of those participating in the debates. more generally, we hope that our work will contribute to finding appropriate strategies to fight online misinformation. while not completely unexpected, it is striking to see how political polarisation affects also the covid-19 debate, giving rise to on-line communities of users that, for number and structure, almost closely correspond to their political affiliations. this section recaps the methodology through which we have obtained the communities of verified users (see section 3.1). this methodology has been designed in saracco et al. [25] and applied in the field of social networks for the first time in [4, 8] . for the sake of completeness, the supplementary material, section 3, recaps the methodology through which we have obtained the validated retweet activity network shown in section 3.3. in section 4 of the supplementary material, the detection of the affiliation of unverified users is described. in the supplementary material, the interested reader will also find additional details about 1) the definition of the null models (section 5); 2) a comparison among various label propagation for the political affiliation of unverified users (section 6); and 3) a brief state of the art on fact checking organizations and literature on false news detection (section 7). many results in the analysis of online social networks (osn) shows that users are highly clustered in group of opinions [1, 11-15, 22, 28, 29] ; indeed those groups have some peculiar behaviours, as the echo chamber effects [14, 15] . following the example of references [4, 8] , we are making use of this users' clustering in order to detect discursive community, i.e. groups of users interacting among themselves by retweeting on the same (covid-related) subjects. remarkably, our procedure does not follow the analysis of the text shared by the various users, but is simply related on the retweeting activity among users. in the present subsection we will examine how the discursive community of verified twitter users can be extracted. on twitter there are two distinct categories of accounts: verified and unverified users. verified users have a thick close to the screen name: the platform itself, upon request from the user, has a procedure to check the authenticity of the account. verified accounts are owned by politicians, journalists or vips in general, as well as the official accounts of ministers, newspapers, newscasts, companies and so on; for those kind of users, the verification procedure guarantees the identity of their account and reduce the risk of malicious accounts tweeting in their name. non verified accounts are for standard users: in this second case, we cannot trust any information provided by the users. the information carried by verified users has been studied extensively in order to have a sort of anchor for the related discussion [4, 6, 8, 20, 27] to detect the political orientation we consider the bipartite network represented by verified (on one layer) and unverified (on the other layer) accounts: a link is connecting the verified user v with the unverified one u if at least one time v was retweeted by u, or viceversa. to extract the similarity of users, we compare the commonalities with a bipartite entropy-based null-model, the bipartite configuration model (bicm [24] ). the rationale is that two verified users that share many links to same unverified accounts probably have similar visions, as perceived by the audience of unverified accounts. we then apply the method of [25] , graphically depicted in fig. 8 , in order to get a statistically validated projection of the bipartite network of verified and unverified users. in a nutshell, the idea is to compare the amount of common linkage measured on the real network with the expectations of an entropy-based null model fixing (on average) the degree sequence: if the associated p-value is so low that the overlaps cannot be explained by the model, i.e. such that it is not compatible with the degree sequence expectations, they carry non trivial information and we project the related information on the (monopartite) projection of verified users. the interested reader can find the technical details about this validated projection in [25] and in the supplementary information. the data that support the findings of this study are available from twitter, but restrictions apply to the availability of these data, which were used under license 1 italian socio-political situation during the period of data collection in the present subsection we present some crucial facts for the understanding of the social context in which our analysis is set. this subsection is divided into two parts: the contagion evolution and the political situation. these two aspects are closely related. a first covid-19 outbreak was detected in codogno, lodi, lombardy region, on february, 19th [1] . in the very next day, two cases were detected in vò, padua, veneto region. on february, 22th, in order to contain the contagions, the national government decided to put in quarantine 11 municipalities, 10 in the area around lodi and vò, near padua [2] . nevertheless, the number of contagions raised to 79, hitting 5 different regions; one of the infected person in vò died, representing the first registered italian covid-19 victim [3] . on february, 23th there were already 229 confirmed cases in italy. the first lockdown should have lasted until the 6th of march, but due to the still increasing number of contagions in northern italy, the italian prime minister giuseppe conte intended to extend the quarantine zone to almost all the northern italy on sunday, march 8th [4] : travel to and from the quarantine zone were limited to case of extreme urgency. a draft of the decree announcing the expansion of the quarantine area appeared on the website of the italian newspaper corriere della sera on the late evening of saturday, 7th, causing some panic in the interested areas [5] : around 1000 people, living in milan, but coming from southern regions, took trains and planes to reach their place of [1] prima lodi, ""paziente 1", il merito della diagnosi va diviso... per due", 8th june 2020 [2] italian gazzetta ufficiale, "decreto-legge 23 febbraio 2020, n. 6". the date is intended to be the very first day of validity of the decree. [3] il fatto quotidiano, "coronavirus,è morto il 78enne ricoverato nel padovano. 15 contagiati in lombardia, un altro in veneto", 22nd february 2020. [4] bbc news, "coronavirus: northern italy quarantines 16 million people", 8th march 2020" [5] the guardian, "leaked coronavirus plan to quarantine 16m sparks chaos in italy", 8th march 2020 origins [6] [7] . in any case, the new quarantine zone covered the entire lombardy and partially other 4 regions. remarkably, close to bergamo, lombardy region, a new outbreak was discovered and the possibility of defining a new quarantine area on march 3th was considered: this opportunity was later abandoned, due to the new northern italy quarantine zone of the following days. this delay seems to have caused a strong increase in the number of contagions, making the bergamo area the most affected one, in percentage, of the entire country [8] ; at time of writing, there are investigations regarding the responsibility of this choice. on march, 9th, the lockdown was extended to the whole country, resulting in the first country in the world to decide for national quarantine [9] . travels were restricted to emergency reason or to work; all business activities that were not considered as essentials, as pharmacies and supermarkets, had to be closed. until the 21st of march lockdown measures became progressively stricter all over the country. starting from the 14th of april, some retails activities as children clothing shops, reopened. a first fall in the number of deaths was observed on the 20th of april [10] . a limited reopening started with the so-called "fase 2" (phase 2 ) on the 4th of may [11] . from the very first days of march, the limited capacity of the intensive care departments to take care of covid-infected patients, took to the necessity of a re-organization of italian hospitals, leading, e.g., to the opening of new intensive care departments [12] . moreover, new communication forms with the relatives of the patients were proposed, new criteria for the intubating patients were developed, and, in the extreme crisis, in the most infected cases, the emergency management took to give priority to the hospitalisation to patients with a higher probability to recover [13] . outbreaks were mainly present in hospitals [19] . unfortunately, healthcare workers were contaminated by the covid [14] . this contagion resulted in a relative high number of fatalities: by the 22nd of april, 145 covid deaths were registered among doctors. due to the pressure on the intensive care capacity, even the healthcare personnel was subject to extreme stress, especially in the most affected zones [15] . on august 8th, 2019, the leader of lega, the main italian right wing party, announced to negate the support to the government of giuseppe conte, which was formed after a post-election coalition between the renzi formed a new center-left party, italia viva (italy alive, iv), due to some discord with pd; despite the scission, italia viva continued to support the actual government, having some of its representatives among the ministers and undersecretaries, but often marking its distance respect to both pd and m5s. due to the great impact that matteo salvini and giorgia meloni -leader of fratelli d'italia, a right wing party-have on social media, they started a massive campaign against the government the day after its inauguration. the regions of lombardy, veneto, piedmont and emilia-romagna experienced the highest number of contagions during the pandemics; among those, the former 3 are administrated by the right and center-right wing parties, the fourth one by the pd. the disagreement in the management of the pandemics between regions and the central government was the occasion to exacerbate the political debate (in italy, regions have a quite wide autonomy for healthcare). the regions administrated by the right wing parties criticised the centrality of the decisions regarding the lock down, while the national government criticises the health management (in lombardy the healthcare system has a peculiar organisation, in which the private sector is supported by public funding) and its non effective measure to reduce the number of contagions. the debate was ridden even at a national level: the opposition criticized the financial origin of the support to the various economic sectors. moreover, the role of the european union in providing funding to recover italian economics after the pandemics was debated. here, we detail the composition of the communities shown in figure 1 of the main text. we remind the reader that, after applying the leuven algorithm to the validated network of verified twitter users, we could observe 4 main communities, that correspond to 1 right wing parties and media (in steel blue) 2 center left wing (dark red) 3 5 stars movement (m5s ), in dark orange 4 institutional accounts (in sky blue) starting from the center-left wing, we can find a darker red community, including various ngos (the italian chapters of unicef, medecins sans frontieres, action aid, emergency, save the children, etc.), various left oriented journalists, vips and pundits [16] . finally, we can find in this group political movements ('6000sardine') and politicians on the left of pd (as beppe civati, pietro grasso, ignazio marino) or on the left current of the pd (laura boldrini, michele emiliano, stefano bonaccini). a slightly lighter red sub-community turns out to be composed by the main politicians of the italian democratic party (pd), as well as by representatives from the european parliament (italian and others) and some eu commissioners. the violet red group is mostly composed by the representatives of the newly founded italia viva, by the former italian prime minister matteo renzi (december 2014 -february 2016) and former secretary of pd. in golden red we can find the subcommunity of catholic and vatican groups. finally the dark violet red and light tomato subcommunities are composed mainly by journalists. interestingly enough, the dark violet red contains also accounts related to the city of milan (the major, the municipality, the public services account) and to the spoke person of the chinese minister of foreign affair. in turn, also the orange (m5s) community shows a clear partition in substructures. in particular, the dark orange subcommunity contains the accounts of politicians, parliament representatives and ministers of the m5s and journalists and the official account of il fatto quotidiano, a newspaper supporting the movement 5 stars. interestingly, since one of the main leaders of the movement, luigi di maio, is also the italian minister of foreign affairs, we can find in this subcommunity also the accounts of several italian embassies around the world, as well as the account of the italian representatives at nato, ocse and oas. in aquamarine, we can find the official accounts of some private and public, national and international, health institutes (as the italian istituto superiore di sanità, literally the italian national institute of health, the world health organization, the fondazione veronesi) the minister of health roberto speranza, and some foreign embassies in italy. finally, in the light slate blue subcommunity we can find various italian ministers as well as the italian police and army forces. similar considerations apply to the steel blue community. in steel blue, the subcommunity of center right and right wing parties (as forza italia, lega and fratelli d'italia). the presidents of the regions of lombardy, veneto and liguria, administrated by center right and right wing parties, can be found here. (in the following this subcommunity is going to be called as fi-l-fdi, recalling the initials of the political parties contributing to this group.) the sky blue subcommunity includes the national federations of various sports, the official accounts of athletes and sport players (mostly soccer) and their teams, as well as sport journals, newscasts and journalists. the teal subcommunity contains the main italian news agencies, some of the main national and local newspapers, [16] as the cartoonists makkox and vauro, the singers marracash, frankiehinrg, ligabue and emphil volo vocal band, and journalists from repubblica (ezio mauro, carlo verdelli, massimo giannini), from la7 tv channel (ricardo formigli, diego bianchi). newscasts and their journalists. in this subcommunity there are also the accounts of many universities; interestingly enough, it includes also the all the local public service local newscasts. the firebrick subcommunity contains accounts related to the as roma football club; analogously in dark red official accounts of ac milan and its players. the slate blue subcommunity is mainly composed by the official accounts of radio and tv programs of mediaset, the main private italian broadcasting company, together with singers and musicians. other smaller subcommunities includes other sport federations, and sports pundits. finally, the sky blue community is mainly composed by italian embassies around the world. the navy subpartition contains also the official accounts of the president of the republic, the italian minister of defense and the one of the commissioner for economy at eu and former prime minister, paolo gentiloni. in the study of every phenomenon, it is of utmost importance to distinguish the relevant information from the noise. here, we remind a framework to obtain a validated monopartite retweet network of users: the validation accounts the information carried by not only the activity of the users, but also by the virality of their messages. we represented pictorially the method in fig. 1 . we define a directed bipartite network in which one layer is composed by accounts and the other one by the tweets. an arrow connecting a user u to a tweet t represents the u writing the message t. the arrow in the opposite direction means that the user u is retweeting the message t. to filter out the random noise from this network, we make use of the directed version of the bicm, i.e. the bipartite directed configuration model (bidcm [15] ). the projection procedure is then, analogous to the one presented in the previous subsection: it is pictorially displayed in the fig. 1 . briefly, consider the couple of users u 0 and u 1 and consider the number of message written by u 0 and shared u 1 . then, calculate which is the distribution of the same measure according with the bidcm: if the related p-value is statistically significant, i.e. if the number of u 0 's tweets shared by u 1 is much more than expected by the bidcm, we project a (directed) link from u 0 to u 1 . summarising, the comparison of the observation on the real network with the bidcm permits to uncover all contributions that cannot originate from the constraints of the null-model. using the technique described in subsection 5.1 of the main text, we are able to assign to almost all verified users a community, based on the perception of the unverified users. due to the fact that the identity of verified users are checked by twitter, we have the possibility of controlling our groups. indeed, as we will show in the following, the network obtained via the bipartite projection provides a reliable description regarding the closeness of opinions and role in the social debate. how can we use this information in order to infer the orientation of non verified users? in the reference [6] we used the tags obtained for both verified and unverified users in the bipartite network described in subsection 5.1 of the main real network c) e) figure 1 schematic representation of the projection procedure for bipartite directed network. a) an example of a real directed bipartite network. for the actual application, the two layers represent twitter accounts (turquoise) and posts (gray). a link from a turquoise node to a gray one represents that the post has been written by the user; a link in the opposite direction represents a retweet by the considered account. b) the bipartite directed configuration model (bidcm) ensemble is defined. the ensemble includes all the link realisations, once the number of nodes per layer has been fixed. c) we focus our attention on nodes i and j and count the number of directed common neighbours (in magenta both the nodes and the links to their common neighbours), i.e., the number of posts written by i and retweeted by j. subsequently, d) we compare this measure on the real network with the one on the ensemble: if this overlap is statistically significant with respect to the bidcm, e) we have a link from i to j in the projected network. text and propagated those labels accross the network. in a recent analysis, we observed that other approaches are more stable [16] : in the present manuscript we make use of the most stable algorithm. we use the label propagation as proposed in [22] on the directed validated network. indeed, the validated directed network in the present appendix we remind the main steps for the definition of an entropy based null model; the interested reader can refer to the review [8] . we start by revising the bipartite configuration model [23] , that has been used for detecting the network of similarities of verified users. we are then going to examine the extension of this model to bipartite directed networks [15] . finally, we present the general methodology to project the information contained in a -directed or undirected-bipartite network, as developed in [24] . let us consider a bipartite network g * bi , in which the two layers are l and γ. define g bi the ensemble of all possible graphs with the same number of nodes per layer as in g * bi . it is possible to define the entropy related to the ensemble as [20] : where p (g bi ) is the probability associated to the instance g bi . now we want to obtain the maximum entropy configuration, constraining some relevant topological information regarding the system. for the bipartite representation of verified and unverified user, a crucial ingredient is the degree sequence, since it is a proxy of the number of interactions (i.e. tweets and retweets) with the other class of accounts. thus in the present manuscript we focus on the degree sequence. let us then maximise the entropy (1), constraining the average over the ensemble of the degree sequence. it can be shown, [24] , that the probability distribution over the ensemble is where m iα represent the entries of the biadjacency matrix describing the bipartite network under consideration and p iα is the probability of observing a link between the nodes i ∈ l and α ∈ γ. the probability p iα can be expressed in terms of the lagrangian multipliers x and y for nodes on l and γ layers, respectively, as in order to obtain the values of x and y that maximize the likelihood to observe the real network, we need to impose the following conditions [13, 26] where the * indicates quantities measured on the real network. actually, the real network is sparse: the bipartite network of verified and unverified users has a connectance ρ 3.58 × 10 −3 . in this case the formula (3) can be safely approximated with the chung-lu configuration model, i.e. where m is the total number of links in the bipartite network. in the present subsection we will consider the case of the extension of the bicm to direct bipartite networks and highlight the peculiarities of the network under analysis in this representation. the adjancency matrix describing a direct bipartite network of layers l and γ has a peculiar block structure, once nodes are order by layer membership (here the nodes on l layer first): where the o blocks represent null matrices (indeed they describe links connecting nodes inside the same layer: by construction they are exactly zero) and m and n are non zero blocks, describing links connecting nodes on layer l with those on layer γ and viceversa. in general m = n, otherwise the network is not distinguishable from an undirected one. we can perform the same machinery of the section above, but for the extension of the degree sequence to a directed degree sequence, i.e. considering the in-and out-degrees for nodes on the layer l, (here m iα and n iα represent respectively the entry of matrices m and n) and for nodes on the layer γ, the definition of the bipartite directed configuration model (bidcm, [15] ), i.e. the extension of the bicm above, follows closely the same steps described in the previous subsection. interestingly enough, the probabilities relative to the presence of links from l to γ are independent on the probabilities relative to the presence of links from γ to l. if q iα is the probability of observing a link from node i to node α and q iα the probability of observing a link in the opposite direction, we have where x out i and x in i are the lagrangian multipliers relative to the node i ∈ l, respectively for the out-and the in-degrees, and y out α and y in α are the analogous for α ∈ γ. in the present application we have some simplifications: the bipartite directed network representation describes users (on one layer) writing and retweeting posts (on the other layer). if users are on the layer l and posts on the opposite layer and m iα represents the user i writing the post α, then k in α = 1 ∀α ∈ γ, since each message cannot have more than an author. notice that, since our constraints are conserved on average, we are considering, in the ensemble of all possible realisations even instances in which k in α > 1 or k in α = 0, or, otherwise stated, non physical; nevertheless the average is constrained to the right value, i.e. 1. the fact that k in α is the same for every α allows for a great simplification of the probability per link on m: where n γ is the total number of nodes on the γ layer. the simplification in (9) is extremely helpful in the projected validation of the bipartite directed network [2] . the information contained in a bipartite -directed or undirected-network, can be projected onto one of the two layers. the rationale is to obtain a monopartite network encoding the non trivial interactions among the two layers of the original bipartite network. the method is pretty general, once we have a null model in which probabilities per link are independent, as it is the case of both bicm and bidcm [24] . the first step is represented by the definition of a bipartite motif that may capture the non trivial similarity (in the case of an undirected bipartite network) or flux of information (in the case of a directed bipartite network). this quantity can be captured by the number of v −motifs between users i and j [11, 23] , or by its direct extension (note that v ij = v ji ). we compare the abundance of these motifs with the null models defined above: all motifs that cannot be explained by the null model, i.e. whose p-value are statistically significance, are validated into the projection on one of the layers [24] . in order to assess the statistically significance of the observed motifs, we calculate the distribution associated to the various motifs. for instance, the expected value for the number of v-motifs connecting i and j in an undirected bipartite network is where p iα s are the probability of the bicm. analogously, where in the last step we use the simplification of (9) [2] . in both the direct and the undirect case, the distribution of the v-motifs or of the directed extensions is poisson binomial one, i.e. a binomial distribution in which each event shows a different probability. in the present case, due to the sparsity of the analysed networks, we can safely approximate the poisson-binomial distribution with a poisson one [14] . in order to state the statistical significance of the observed value, we calculate the related p-values according to the relative null-models. once we have a p-value for every detected v-motif, the related statistical significance can be established through the false discovery rate (fdr) procedure [3] . respect to other multiple test hypothesis, fdr controls the number of false positives. in our case, all rejected hypotheses identify the amount of v-motifs that cannot be explained only by the ingredients of the null model and thus carry non trivial information regarding the systems. in this sense, the validated projected network includes a link for every rejected hypothesis, connecting the nodes involved in the related motifs. in the main text, we solved the problem of assigning the orientation to all relevant users in the validated retweet network via a label propagation. the approach is similar, but different to the one proposed in [6] , the differences being in the starting labels, in the label propagation algorithm and in the network used. in this section we will revise the method employed in the present article, as compared it to the one in [6] and evaluate the deviations from other approaches. first step of our methodology is to extract the polarisation of verified users from the bipartite network, as described in section 5.1 of the main text, in order to use it as seed labels in the label propagation. in reference [6] , a measure of the "adherence" of the unverified users towards the various communities of verified users was used in order to infer their orientation, following the approach in [2] , in turn based on the polarisation index defined in [4] . this approach was extremely performing when practically all unverified users interact at least once with verified one, as in [2] . while still having good performances in a different dataset as the one studied in [6] , we observed isolated deviations: it was the case of users with frequent interactions with other unverified accounts of the same (political) orientation, randomly retweeting a different discursive community verified user. in this case, focusing just on the interaction with verified accounts, those nodes were assigned a wrong orientation. the labels for the polarisation of the unverified users defined [6] were subsequently used as seed labels in the label propagation. due to the possibility described above of assigning wrongly labels to unverified accounts, in the present paper, we consider only the tags of verified users, since they pass a strict validation procedure and are more stable. in order to compare the results obtained with the various approaches, we calculated the variation of information (vi, [17] ). v i considers exactly the different in information contents captured by two different partition, as consider by the shannon entropy. results are reported in the matrix in figure 2 for the 23th of february (results are similar for other days). even when using the weighted retweet network as "exact" result, the partition found by the label propagation of our approach has a little loss of information, comparable with the one of using an unweighted approach. indeed, the results found by the various community detection algorithms show little agreement with the label propagation ones. nevertheless, we still prefer the label propagation procedure, since the validated projection on the layer of verified users is theoretically sound and has a non trivial interpretation. the main result of this work quantifies the level of diffusion on twitter of news published by sources considered scarcely reputable. academy, governments, and news agencies are working hard to classify information sources according to criteria of credibility and transparency of published news. this is the case, for example, of newsguard, which we used for the tagging of the most frequent domains in the direct validated network obtained according to the methodology presented in the previous sections. as introduced in subsection 3.2 of the main text, the newsguard browser extension and mobile app [19] offers a reliability result for the most popular newspapers in the world, summarizing with a numerical score the level of credibility and journalistic transparency of the newspaper. with the same philosophy, but oriented towards us politics, the fact-checking site politifact.com reports with a 'truth meter' the degree of truthfulness of original claims made by politicians, candidates, their staffs, and, more, in general, protagonists of us politics. one of the eldest fact-checking websites dates back to 1994: snopes.com, in addition to political figures, is a fact-checker for hoaxes and urban legends. generally speaking, a fact-checking site has behind it a multitude of editors and journalists who, with a great deal of energy, manually check the reliability of a news, or of the publisher of that news, by evaluating criteria such as, e.g., the tendency to correct errors, the nature of the newspaper's finances, and if there is a clear differentiation between opinions and facts. thus, it is worth noting that recent attempts tried to automatically find articles worthy of being fact-checked. for example, work in [1] uses a supervised classifier, based on an ensemble of neural networks and support vector machines, to figure out which politicians' claims need to be debunked, and which have already been debunked. despite the tremendous effort of stakeholders to keep the fact-checking sites up to date and functioning, disinformation resists debunking due to a combination of factors. there are psychological aspects, like the quest for belonging to a community and getting reassuring answers, the adherence to one's viewpoint, a native reluctance to change opinion [28, 29] , the formation of echo chambers [10] , where people polarize their opinions as they are insulated from contrary perspectives: these are key factors for people to contribute to the success of disinformation spreading [7, 9] . moreover, researchers demonstrate how the spreading of false news is strategically supported by the massive and organized use of trolls and bots [25] . despite the need to educate the user to a conscious fruition of online information through means also different from those represented by technological solutions, there are a series of promising works that exploit classifiers based on machine learning or on deep learning to tag a news as credible or not. one interesting approach is based on the analysis of spreading patterns on social platforms. monti et al. recently provide a deep learning framework for detection of fake news cascades [18] . a ground truth is acquired by following the example by vosoughi et al. [27] collecting twitter cascades of verified false and true rumors. employing a novel deep learning paradigm for graph-based structures, cascades [19] https://www.newsguardtech.com/ are classified based on user profile, user activity, network and spreading, and content. the main result of the work is that 'a few hours of propagation are sufficient to distinguish false news from true news with high accuracy'. this result has been confirmed by other studies too. work in [30] , by zhao et al. examine diffusion cascades on weibo and twitter: focusing on topological properties, such as the number of hops from the source and the heterogeneity of the network, the authors demonstrate that networks in which fake news are diffused feature characteristics really different from those diffusing genuine information. diffusion networks investigation appear to be a definitive path to follow for fake news detection. this is also confirmed by pierri et al. [21] : also here, the goal is to classifying news articles pertaining to bad and genuine information' by solely inspecting their diffusion mechanisms on twitter'. even in this case, results are impressive: a simple logistic regression model is able to correctly classify news articles with a high accuracy (auroc up to 94%). the political blogosphere and the 2004 u.s. election: divided they blog 2020) coronavirus: 'deadly masks' claims debunked coronavirus: bill gates 'microchip' conspiracy theory and other vaccine claims fact-checked extracting significant signal of news consumption from social networks: the case of twitter in italian political elections fast unfolding of communities in large networks influence of fake news in twitter during the 2016 us presidential election how does junk news spread so quickly across social media? algorithms, advertising and exposure in public life the role of bot squads in the political propaganda on twitter tracking social media discourse about the covid-19 pandemic: development of a public coronavirus twitter data set the statistical physics of real-world networks political polarization on twitter predicting the political alignment of twitter users partisan asymmetries in online political activity echo chambers: emotional contagion and group polarization on facebook mapping social dynamics on facebook: the brexit debate 2020) tackling covid-19 disinformation -getting the facts right ) speech of vice president věra jourová on countering disinformation amid covid-19 -from pandemic to infodemic filter bubbles, echo chambers, and online news consumption community detection in graphs finding users we trust: scaling up verified twitter users using their communication patterns opinion dynamics on interacting networks: media competition and social influence near linear time algorithm to detect community structures in large-scale networks randomizing bipartite networks: the case of the world trade web inferring monopartite projections of bipartite networks: an entropy-based approach maximum-entropy networks. pattern detection, network reconstruction and graph combinatorics journalists on twitter: self-branding, audiences, and involvement of bots emotional dynamics in the age of misinformation debunking in a world of tribes coronavirus, a milano la fuga dalla "zona rossa": folla alla stazione di porta garibaldi coronavirus, l'illusione della grande fuga da milano. ecco i veri numeri degli spostamenti verso sud coronavirus: italian army called in as crematorium struggles to cope with deaths coronavirus: italy extends emergency measures nationwide italy sees first fall of active coronavirus cases: live updates coronavirus in italia, verso primo ok spostamenti dal 4/5, non tra regioni italy's health care system groans under coronavirus -a warning to the world negli ospedali siamo come in guerra. a tutti dico: state a casa coronavirus: ordini degli infermieri, 4 mila i contagiati automatic fact-checking using context and discourse information extracting significant signal of news consumption from social networks: the case of twitter in italian political elections controlling the false discovery rate: a practical and powerful approach to multiple testing users polarization on facebook and youtube fast unfolding of communities in large networks the role of bot squads in the political propaganda on twitter the psychology behind fake news the statistical physics of real-world networks fake news: incorrect, but hard to correct. the role of cognitive ability on the impact of false information on social impressions echo chambers: emotional contagion and group polarization on facebook graph theory (graduate texts in mathematics) resolution limit in community detection maximum likelihood: extracting unbiased information from complex networks. phys rev e -stat nonlinear on computing the distribution function for the poisson binomial distribution reconstructing mesoscale network structures the contagion of ideas: inferring the political orientations of twitter accounts from their connections comparing clusterings by the variation of information fake news detection on social media using geometric deep learning at the epicenter of the covid-19 pandemic and humanitarian crises in italy: changing perspectives on preparation and mitigation. catal non-issue content 20 near linear time algorithm to detect community structures in large-scale networks randomizing bipartite networks: the case of the world trade web inferring monopartite projections of bipartite networks: an entropy-based approach the spread of low-credibility content by social bots analytical maximum-likelihood method to detect patterns in real networks a question of belonging: race, social fit, and achievement cognitive and social consequences of the need for cognitive closure fake news propagate differently from real news even at early stages of spreading analysis of online misinformation during the peak of the covid-19 pandemics in italy supplementary material guido caldarelli 1,2,3* † , rocco de nicola 3 † , marinella petrocchi 4 † , manuel pratelli 3 † and fabio saracco 3 † there is another difference in the label propagation used here against the one in [6] : in the present paper we used the label propagation of [22] , while the one in [6] was quite home-made. as in reference [22] , the seed labels of [6] are fixed, i.e. are not allowed to change [17] . the main difference is that, in case of a draw, among the labels of the first neighbours, in [22] a tie is removed randomly, while in the algorithm of [6] the label is not assigned and goes into a new run, with the newly assigned labels. moreover, the updated of labels in [22] is asynchronous, while it is synchronous in [6] . we opted for the one in [22] for being actually a standard in the label propagation algorithms, being stable, more studied, and faster [18] . finally, differently from the procedure in [6] , we applied the label propagation not to the entire (undirected version of the) retweet network, but on the (undirected version of the) validated one. (the intent of choosing the undirected version is that in both case in which a generic account is significantly retweeting or being retweeted by another one, they do probably share some vision of the phenomena under analysis, thus we are not interested in the direction of the links, in this situation.) the rationale in using the validated network is to reduce the calculation time (due to the dimensions of the dataset), while obtaining an accurate result. while the previous differences from the procedure of [6] are dictated by conservativeness (the choice of the seed labels) or by the adherence to a standard (the choice of [22] ), this last one may be debatable: why choosing the validated network should return "better" results than the ones calculated on the entire retweet network? we consider the case of a single day (in order to reduce the calculation time) and studied 6 different approaches:1 a louvain community detection [5] on the undirected version of the validated network of retweets; 2 a louvain community detection on the undirected version of the unweighted retweet network; 3 a louvain community detection on the undirected version of the weighted retweet network, in which the weights are the number of retweets from user to user; 4 a label propagation a la raghavan et al. [22] on the directed validated network of retweets; 5 a label propagation a la raghavan et al. on the (unweighted) retweet network; 6 a label propagation a la raghavan et al. on the weighted retweet network, the weights being the number of retweets from user to user. actually, due to the order dependence of louvain [12] , we run several times the louvain algorithm after reshuffling the order of the nodes, taking the partition in communities that maximise the modularity. similarly, the label propagation of [22] has a certain level of randomness: we run it several times and choose the most frequent label assignment for every node. key: cord-280648-1dpsggwx authors: gillen, david; morrison, william g. title: regulation, competition and network evolution in aviation date: 2005-05-31 journal: journal of air transport management doi: 10.1016/j.jairtraman.2005.03.002 sha: doc_id: 280648 cord_uid: 1dpsggwx abstract our focus is the evolution of business strategies and network structure decisions in the commercial passenger aviation industry. the paper reviews the growth of hub-and-spoke networks as the dominant business model following deregulation in the latter part of the 20th century, followed by the emergence of value-based airlines as a global phenomenon at the end of the century. the paper highlights the link between airline business strategies and network structures, and examines the resulting competition between divergent network structure business models. in this context we discuss issues of market structure stability and the role played by competition policy. taking a snapshot of the north american commercial passenger aviation industry in the spring of 2003, the signals on firm survivability and industry equilibrium are mixed; some firms are under severe stress while others are succeeding in spite of the current environment. 1 in the us, we find united airlines in chapter 11 and us airways emerging from chapter 11 bankruptcy protection. we find american airlines having just reported the largest financial loss in us airline history, while delta and northwest airlines along with smaller carriers like alaska, america west and several regional carriers are restructuring and employing cost reduction strategies. we also find continental airlines surviving after having been in and out of chapter 11 in recent years, while southwest airlines continues to be profitable. in canada, we find air canada in companies creditors arrangement act (cca) bankruptcy protection (the canadian version of chapter 11), after reporting losses of over $500 million for the year 2002 and in march 2003. meanwhile westjet, like southwest continues to show profitability, while two new carriers, jetsgo and canjet (reborn), have entered the market. looking at europe, the picture is much the same, with large full-service airlines (fsas hereafter) such as british airways and lufthansa sustaining losses and suffering financial difficulties, while value-based airlines (vbas) like ryanair and easyjet continue to grow and prosper. until recently, asian air travel markets were performing somewhat better than in north america, however the severe acute respiratory syndrome (sars) epidemic had a severe negative effect on many asian airlines. 2 clearly, the current environment is linked to several independent negative demand shocks that have hit the industry hard. 3 slowdown was already underway in 2001, prior to the 9-11 tragedy, which gave rise to the 'war on terrorism' followed by the recent military action in iraq. finally, the sars virus has not only severely diminished the demand for travel to areas where sars has broken out and led to fatalities, but it has also helped to create yet another reason for travellers to avoid visiting airports or travelling on aircraft, based on a perceived risk of infection. all of these factors have created an environment where limited demand and price competition has favoured the survival of airlines with a low-cost, lowprice focus. in this paper we examine the evolution of air transport networks after economic deregulation, and the connection between networks and business strategies, in an environment where regulatory changes continue to change the rules of the game. the deregulation of the us domestic airline industry in 1978 was the precursor of similar moves by most other developed economies in europe (beginning 1992-1997) , canada (beginning in 1984) , australia (1990) and new zealand (1986) . 4 the argument was that the industry was mature and capable of surviving under open market conditions subject to the forces of competition rather than under economic regulation. 5 prior to deregulation in the us, some airlines had already organized themselves into hub-and-spoke net-works. delta airlines, for example, had organized its network into a hub at atlanta with multiple spokes. other carriers had evolved more linear networks with generally full connectivity and were reluctant to shift to hub-and-spoke for two reasons. first, regulations required permission to exit markets and such exit requests would likely lead to another carrier entering to serve 'public need'. secondly, under regulation it was not easy to achieve the demand side benefits associated with networks because of regulatory barriers to entry. in the era of economic regulation the choice of frequency and ancillary service competition were a direct result of being constrained in fare and market entry competition. with deregulation, airlines gained the freedom to adapt their strategies to meet market demand and to reorganize themselves spatially. consequently, huband-spoke became the dominant choice of network structure. the hub-and-spoke network structure was perceived to add value on both the demand and cost side. on the demand side, passengers gained access to broad geographic and service coverage, with the potential for frequent flights to a large number of destinations. 6 large carriers provided lower search and transactions costs for passengers and reduced through lower time costs of connections. they also created travel products with high convenience and service levels-reduced likelihood of lost luggage, in-flight meals and bar service for example. the fsa business model thus favoured high service levels which helped to build the market the market at a time when air travel was an unusual or infrequent activity for many individuals. building the market not only meant encouraging more air travel but also expanding the size of the network which increased connectivity and improved aircraft utilization. on the cost side the industry was shown to have few if any economies of scale, but there were significant economies of density. feeding spokes from smaller centres into a hub airport enabled full service carriers to operate large aircraft between major centres with passenger volumes that lowered costs per available seat. an early exception to the hub-and-spoke network model was southwest airlines. in the us, southwest airlines was the original 'vba' representing a strategy designed to build the market for consumers whose main loyalty is to low-price travel. this proved to be a sustainable business model and southwest's success was to create a blueprint for the creation of other vbas around the world. the evolution has also been assisted by the disappearance of charter airlines with deregulation as fsas served a larger scope of the demand function through their yield management system. (footnote continued) of economies from manufacturing to service economies and service industries are more aviation intensive than manufacturing. developed economies as in europe and north america as well as australia and new zealand, have an increasing proportion of gdp provided by service industries particularly tourism. one sector that is highly aviation intensive is the high technology sector. it is footloose and therefore can locate just about anywhere; the primary input is human capital. it can locate assembly in low-cost countries and this was enhanced under new trade liberalization with the wto. 4 canada's deregulation was not formalised under the national transportation act until 1987. australia and new zealand signed an open skies agreement in 2000, which created a single australia-new zealand air market, including the right of cabotage. canada and the us signed an open skies agreement well in 1996 but not nearly so liberal as the australian-new zealand one. 5 in contrast to deregulation within domestic borders, international aviation has been slower to introduce unilateral liberalization. consequently the degree of regulation varies across routes, fares, capacity, entry points (airports) and other aspects of airline operations depending upon the countries involved. the us-uk, german, netherlands and korea bilaterals are quite liberal, for example. in some cases, however, most notably in australasia and europe, there have been regional air trade pacts, which have deregulated markets between and within countries. the open skies agreement between canada and the us is similar to these regional agreements. meanwhile, benefits of operating a large hub-andspoke network in a growing market led to merger waves in the us (mid-1980s) and in canada (late-1980s) and consolidation in other countries of the world. large firms had advantages from the demand side, since they were favoured by many passengers and most importantly by high yield business passengers. they also had advantages from the supply side due to economies of density and economies of stage length. 7 in most countries other than the us there tended to be high industry concentration with one or at most two major carriers. it was also true that in most every country except the us there was a national (or most favoured) carrier that was privatized at the time of deregulation or soon thereafter. in canada in 1995 the open skies agreement with the us was brought in. 8 around this time we a new generation of vbas emerged. in europe, ryanair and easyjet experienced rapid and dramatic growth following deregulation within the eu. some fsas responded by creating their own vbas: british airways created go, klm created buzz and british midland created bmibaby for example. westjet airlines started service in western canada in 1996 serving three destinations and has grown continuously since that time. canadian airlines, faced with increased competition in the west from westjet as well as aggressive competition from air canada on longer haul routes, was in a severe financial by the late 1990s. a bidding war for a merged air canada and canadian was initiated and in 2000, air canada emerged the winner with a 'winners curse', having assumed substantial debt and constraining service and labour agreements. canada now had one fsa and three or four smaller airlines, two of which were vbas. in the new millennium, some consolidation has begun to occur amongst vbas in europe with the merger of, easyjet and go in 2002, and the acquisition of buzz by ryanair in 2003. more importantly perhaps, the vba model has emerged as a global phenomenon with vba carriers such as virgin blue in australia, gol in brazil, germania and hapag-lloyd in germany and air asia in malaysia. looking at aviation markets since the turn of the century, casual observation would suggest that a combination of market circumstances created an opportunity for the propagation of the vba business model-with a proven blueprint provided by southwest airlines. however a question remains as to whether something else more fundamental has been going on in the industry to cause the large airlines and potentially larger alliances to falter and fade. if the causal impetus of the current crisis was limited to cyclical macro factors combined with independent demand shocks, then one would expect the institutions that were previously dominant to re-emerge once demand rebounds. if this seems unlikely it is because the underlying market environment has evolved into a new market structure, one in which old business models and practices are no longer viable or desirable. the evolution of business strategies and markets, like biological evolution is subject to the forces of selection. airlines who cannot or do not adapt their business model to long-lasting changes in the environment will disappear, to be replaced by those companies whose strategies better fit the evolved market structure. but to understand the emerging strategic interactions and outcomes of airlines one must appreciate that in this industry, business strategies are necessarily tied to network choices. the organization of production spatially in air transportation networks confers both demand and supply side network economies and the choice of network structure by a carrier necessarily reflects aspects of its business model and will exhibit different revenue and cost drivers. in this section we outline important characteristics of the business strategy and network structures of two competing business models: the full service strategy (utilizing a hub-and-spoke network) and the low cost strategy model which operates under a partial point-to-point network structure. the full service business model is predicated on broad service in product and in geography bringing customers to an array of destinations with flexibility and available capacity to accommodate different routings, no-shows and flight changes. the broad array of destinations and multiple spokes requires a variety of aircraft with differing capacities and performance characteristics. the variety increases capital, labour and operating costs. this business model labours under cost penalties and lower productivity of hub-and-spoke operations including long aircraft turns, connection slack, congestion, and personnel and baggage online connections. these features take time, resources and labour, all of which are expensive and are not easily avoided. the hub-and-spoke system is also conditional on airport and airway infrastructure, information provision through computer reservation and highly sophisticated yield management systems. the network effects that favoured hub and spoke over linear connected networks lie in the compatibility of article in press 7 unit costs decrease as stage length increases but at a diminishing rate. 8 there was a phase in period for select airports in canada as well as different initial rules for us and canadian carriers. flights and the internalization of pricing externalities between links in the network. a carrier offering flights from city a to city b through city h (a hub) is able to collect traffic from many origins and place them on a large aircraft flying from h to b, thereby achieving density economies. in contrast a carrier flying directly from a to b can achieve some direct density economies but more importantly gains aircraft utilization economies. in the period following deregulation, density economies were larger than aircraft utilization economies on many routes, owing to the limited size of many origin and destination markets. on the demand side, fsas could maximize the revenue of the entire network by internalizing the externalities created by complementarities between links in the network. in our simple example, of a flight from a to c via hub h the carrier has to consider how pricing of the ah link might affect the demand for service on the hb link. if the service were offered by separate companies, the company serving ah will take no consideration of how the fare it charged would influence the demand on the hb link since it has no right to the revenue on that link. the fsa business model thus creates complexity as the network grows, making the system work effectively requires additional features most notably, yield management and product distribution. in the period following deregulation, technological progress provided the means to manage this complexity, with large information systems and in particular computer reservation systems. computer reservation systems make possible sophisticated flight revenue management, the development of loyalty programs, effective product distribution, revenue accounting and load dispatch. they also drive aircraft capacity, frequency and scheduling decisions. as a consequence, the fsa business model places relative importance on managing complex schedules and pricing systems with a focus on profitability of the network as a whole rather than individual links. the fsa business model favours a high level of service and the creation of a large service bundle (inflight entertainment, meals, drinks, large numbers of ticketing counters at the hub, etc.) which serves to maximize the revenue yields from business and longhaul travel. an important part of the business service bundle is the convenience that is created through fully flexible tickets and high flight frequencies. high frequencies can be developed on spoke routes using smaller feed aircraft, and the use of a hub with feed traffic from spokes allows more flights for a given traffic density and cost level. more flights reduce total trip time, with increased flexibility. thus, the hub-and-spoke system leads to the development of feed arrangements along spokes. indeed these domestic feeds contributed to the development of international alliances in which one airline would feed another utilizing the capacity of both to increase service and pricing. like the fsa model, the vba business plan creates a network structure that can promote connectivity but in contrast trades off lower levels of service, measured both in capacity and frequency, against lower fares. in all cases the structure of the network is a key factor in the success of vbas even in the current economic and demand downturn. vbas tend to exhibit common product and process design characteristics that enable them to operate at a much lower cost per unit of output. 9 on the demand side, vbas have created a unique value proposition through product and process design that enables them to eliminate, or ''unbundle'' certain service features in exchange for a lower fare. these service feature trade-offs are typically: less frequency, no meals, no free, or any, alcoholic beverages, more passengers per flight attendant, no lounge, no interlining or code-sharing, electronic tickets, no pre-assigned seating, and less leg room. most importantly the vba does not attempt to connect its network although their may be connecting nodes. it also has people use their own time to access or feed the airport. 10 there are several key areas in process design (the way in which the product is delivered to the consumer) for a vba that result in significant savings over a full service carrier. one of the primary forms of process design savings is in the planning of point-to-point city pair flights, focusing on the local origin and destination market rather than developing hub systems. in practice, this means that flights are scheduled without connections and stops in other cities. this could also be considered product design, as the passenger notices the benefit of travelling directly to their desired destination rather than through a hub. rather than having a bank of flights arrive at airports at the same time, low-cost carriers spread out the staffing, ground handling, maintenance, food services, bridge and gate requirements at each airport to achieve savings. another less obvious, but important cost saving can be found in the organization design and culture of the company. it is worth noting at this point that the innovator of product, process, and organizational redesign is generally accepted to be southwest airlines. many low-cost start-ups have attempted to replicate that model as closely as possible; however, the hardest area to replicate has proved to be the organization design and culture. 11 extending the ''look and feel'' to the aircraft, there is a noticeable strategy for low-cost airlines. successful vbas focus on a homogeneous fleet type (mostly the boeing 737 but this is changing; e.g. jet blue with a320 fleet). the advantages of a 'common fleet' are numerous. purchasing power is one-with the obvious exception of the aircraft itself, heavy maintenance, parts, supplies; even safety cards are purchased in one model for the entire fleet. training costs are reduced-with only one type of fleet, not only do employees focus on one aircraft and become specialists, but economies of density can be achieved in training. the choice of airports is typically another source of savings. low-cost carriers tend to focus on secondary airports that have excess capacity and are willing to forego some airside revenues in exchange for non-airside revenues that are developed as a result of the traffic stimulated from low-cost airlines. in simpler terms, secondary airports charge less for landing and terminal fees and make up the difference with commercial activity created by the additional passengers. further, secondary airports are less congested, allowing for faster turn times and more efficient use of staff and the aircraft. the average taxi times shown in table 1 (below) are evidence of this with respect to southwest in the us and one only has to consider the significant taxi times at pearson airport in toronto to see why hamilton is such an advantage for westjet. essentially, vbas have attempted to reduce the complexity and resulting cost of the product by unbundling those services that are not absolutely necessary. this unbundling extends to airport facilities as well, as vbas struggle to avoid the costs of expensive primary airport facilities that were designed with full service carriers in mind. while the savings in product design are the most obvious to the passenger, it is the process changes that have produced greater savings for the airline. the design of low-cost carriers facilitates some revenue advantages in addition to the many cost advantages, but it is the cost advantages that far outweigh any revenue benefits achieved. these revenue advantages included simplified fare structures with 3-4 fare levels, a simple 'yield' management system, and the ability to have one-way tickets. the simple fare structure also facilitates internet booking. however, what is clearly evident is the choice of network is not independent of the firm strategy. the linear point-topoint network of vbas allows it to achieve both cost and revenue advantages. table 1 below, compares key elements of operations for us airlines 737 fleets. one can readily see a dramatic cost advantage for southwest airlines compared to fsas. in particular, southwest is a market leader in aircraft utilization and average taxi times. if one looks at the differences in the us between vbas like southwest and fsas, there is a 2:1 cost difference. this difference is similar to what is found in canada between westjet and air canada as well as in europe. these carriers buy the fuel and capital in the same market, and although there may be some difference between carriers due to hedging for example, these are not structural or permanent changes. the vast majority of the cost difference relates to product and process complexity. this complexity is directly tied to the design of their network structure. table 2 compares cost drivers for fsas and vbas in europe. the table shows the key underlying cost drivers and where a vba like ryanair has an advantage over fsas in crew and cabin personnel costs, airport charges and distribution costs. the first two are directly linked to network design. a hub-and-spoke network is service it should also be noted that the vba model is not generic. different low cost carriers do different things and like all businesses we see continual redefinition of the model. intensive and high cost. even distribution cost-savings are related indirectly to network design because vbas have simple products and use passengers' time as an input to reduce airline connect costs. in europe, ryanair has been a leader in the use of the internet for direct sales and 'e-tickets'. in the us southwest airlines was an innovator in ''e-ticketing'', and was also one of the first to initiate bookings on the internet. vbas avoid travel agency commissions and ticket production costs: in canada, westjet has stated that internet booking account for approximately 40% of their sales, while in europe, ryanair claimed an internet sales percentage of 91% in march 2002. 12 while most vbas have adopted direct selling via the internet, the strategy has been hard for fsas to respond to with any speed given their complex pricing systems. recent moves by full service carriers in the us and canada to eliminate base commissions should prove to be interesting developments in the distribution chains of all airlines. to some degree, vbas have positioned themselves as market builders by creating point-to-point service in markets where it could not be warranted previously due to lower traffic volumes at higher fsa fares. vbas not only stimulate traffic in the direct market of an airport, but studies have shown that vbas have a much larger potential passenger catchment area than fsas. the catchment area is defined as the geographic region surrounding an airport from which passengers are derived. while an fsa relies on a hub-and-spoke network to create catchment, low-cost carriers create the incentive for each customer to create their own spoke to the point of departure. table 3 provides a summary of the alternative airline strategies pursued in canada, and elsewhere in the world. the trend worldwide thus far indicates two quite divergent business strategies. the entrenched fsa carriers' focuses on developing hub and spoke networks while new entrants seem intent on creating low-cost, point-to-point structures. the hub and spoke system places a very high value on the feed traffic brought to the hub by the spokes, especially the business traffic therein, thereby creating a complex, marketing intense business where revenue is the key and where production costs are high. inventory (of seats) is also kept high in order to meet the service demands of business travellers. the fsa strategy is a high cost strategy because the hub-and-spoke network structure means both reduced productivity for capital (aircraft) and labour (pilots, cabin crew, airport personnel) and increased costs due to self-induced congestion from closely spaced banks of aircraft. 13 the fsa business strategy is sustainable as long as no subgroup of passengers can defect from the coalition of all passenger groups, and recognizing this, competition between fsas included loyalty programs designed to protect each airline's coalition of passenger groupsfrequent travellers in particular. the resulting market structure of competition between fsas was thus a cozy oligopoly in which airlines competed on prices for some economy fares, but practiced complex price discrimination that allowed high yields on business travel. however, the vulnerability of the fsa business model was eventually revealed through the vba strategy which (a) picked and chose only those origin-destination links that were profitable and (b) targeted price sensitive consumers. 14 the potential therefore was not for business travellers to defect from fsas (loyalty programs helped to maintain this segment of demand) but for leisure travellers and other infrequent flyers to be lured away by lower fares (fig. 1) . figs. 2 and 3 present a schemata that help to summarize the contributory factors that propagated the fsa hub-and-spoke system and made it dominant, followed by the growth of the vba strategy along with the events and factors that now threaten the fsa model. in this section we set out a simple framework to explain the evolution of network equilibrium and show westjet estimated that a typical ticket booked through their call centre costs roughly $12, while the same booking through the internet costs around 50 cents. 13 airlines were able to reduce their costs to some degree by purchasing ground services from third parties. unfortunately they could not do this with other processes of the business. 14 vbas will also not hesitate to exit a market if it is not profitable (e.g. westjet's recent decision to leave sault st. marie and sudbury) while fsas are reluctant to exit for fear of missing feed traffic and beyond revenue. how it is tied to the business model. the linkage will depend on how the business models differ with respect to the integration of demand conditions, fixed and variable cost and network organization. let three nodes {y 1 ; y 2 ,y 3 ; (0,0), (0,1), (1,0)}, form the corner coordinates of an isosceles right triangle. the nodes and the sides of the triangle may thus represent a simple linear travel network that defines fully connected network hub-and-spoke network partial point-to-point network congestion or other factors affecting passenger throughput at airports. this simple network structure allows us to compare three possible structures for the supply of travel services: a complete (fully connected) point-to-point network (all travel constitutes a direct link between two nodes); a hub-and-spoke network (travel between y 1 and y 2 requires a connection through y 2 ) and limited (or partial) point-to-point network (selective direct links between nodes). these are illustrated in fig. 3 below. in the network structures featuring point-to-point travel, the utility of consumers who travel depends only on a single measure of the time duration of travel and a single measure of convenience. however in the hub-andspoke network, travel between y 1 and y 3 requires a connection at y 2 ; consequently the time duration of travel depends upon the summed distance d 1c3 â¼ d 12 ã¾ d 23 â¼ 1 ã¾ ffiffi ffi 2 p : furthermore, in a hub-and-spoke network, there is interdependence between the levels of convenience experienced by travellers. if there are frequent flights between y 1 and y 2 but infrequent flights between y 2 and y 3 ; then travellers will experience delays at y 2 : there has been an evolving literature on the economics of networks or more properly the economics of network configuration. hendricks et al. (1995) show that economies of density can explain the hub-andspoke system as the optimal system in the airline networks. the key to the explanation lies in the level of density economies. however, when comparing a point-to-point network they find the hub-and-spoke network is preferred when marginal costs are high and demand is low but given some fixed costs and intermediate values of variable costs a point-to-point network may be preferred. shy (2001) shows that profit levels on a fully connected (fc) network are higher than on a hub-and-spoke network when variable flight costs are relatively low and passenger disutility with connections at hubs is high. what had not been explained well, until pels et al. (2000) is the relative value of market size to achieve lower costs per available seat mile (asm) versus economies of density. pels et al. (2000) explore the optimality of airline networks using linear marginal cost functions and linear, symmetric demand functions; mc â¼ 1 ã� bq and p â¼ a ã� q=2 where b is a returns to density parameter and a is a measure of market size. the pels model demonstrates the importance of fixed costs in determining the dominance of one network structure over another in terms of optimal profitability. in particular, the robustness of the hub-and-spoke network configuration claimed by earlier authors (hendricks et al., 1995) comes into question. in our three-node network, the pels model generates two direct markets and one transfer market in the huband-spoke network, compared with three direct markets in the fully connected network. defining aggregate demand as q â¼ q d ã¾ q t ; the profits from a hub-andspoke network, are: y while the profits of a fc network are: y more generally, for a network of size n, hub-and-spoke optimal profits are: y and fc profits are: under what conditions would an airline be indifferent between network structure? the market size at which profit maximizing prices and quantities equate the profits in each network structure is where, the two possible values of a ã� implied by (5) represent upper and lower boundaries on the market size for which the hub-and-spoke network and the fully connected network generate the same level of optimal profits. these boundary values are of course conditional on given values of the density economies parameter (y) fixed costs (f), and the size of the network (n). these parameters can provide a partial explanation for the transition from fc to hub-and-spoke network structures after deregulation. with relatively low returns to density, and low fixed costs per link, even in a growing market, the hub-andspoke structure generates inferior profits compared with the fc network, except when the market size (a) is extremely high. however with high fixed costs per network link, the hub-and-spoke structure begins to dominate at a relatively small market size and this advantage is amplified as the size of the network grows. importantly in this model, dominance does not mean that the inferior network structure is unprofitable. in (a; b) space, the feasible area (defining profitability) of the fc structure encompasses that of the hub-and-spoke structure. this accommodates the observation that not all airlines adopted the hub-and-spoke network model following deregulation. where the model runs into difficulties is in explaining the emergence of limited point-to-point networks and the vba model. it is the symmetric structure of the model that renders it unable to capture some important elements of the environment in which vbas have been able to thrive. in particular, three elements of asymmetry are missing. first, the model does not allow for asymmetric demand growth between nodes in the network. with market growth, returns to density can increase on a subset of links that would have been feeder spokes in the hub-and-spoke system when the market was less developed. these links may still be infeasible for fsas but become feasible and profitable as independent point-to-point operations, providing an airline has low enough costs. second, the model does not distinguish between market demand segments and therefore cannot capture the gradual commoditization of air travel, as more consumers become frequent flyers. to many consumers today, air travel is no longer an exotic product with an air of mystery and an association with wealth and luxury. there has been an evolution of preferences that reflects the perception that air travel is just another means of getting from a to b. as the perceived nature of the product becomes more commodity-like, consumers become more price sensitive and are willing to trade off elements of service for lower prices. 15 vbas use their low fares to grow the market by competing with other activities. their low cost structure permits such a strategy. fsas cannot do this to any degree because of their choice of bundled product and higher costs. third, the model does not capture important asymmetries in the costs of fsas and vbas, such that vbas have significantly lower marginal and fixed costs. notice that the dominance of the hub-and-spoke structure over the fc network relies in part on the cost disadvantage of a fixed cost per link, which becomes prohibitive in the fc network as the number of nodes (n) gets large. vbas do not suffer from this disadvantage because they can pick and choose only those nodes that are profitable. furthermore, fsas variable costs are higher because of the higher fixed costs associated with their choice of hub-and-spoke network. it would seem that with each new economic cycle, the evolution of the airline industry brings about an industry reconfiguration. several researchers have suggested that this is consistent with an industry structure with an 'empty core', meaning non-existence of a natural market equilibrium. button (2003) makes the argument as follows. we know that a structural shift in the composition (i.e., more low-cost airlines) of the industry is occurring and travel substitutes are pushing down fares and traffic. we also observe that heightened security has increased the time and transacting costs of trips and these are driving away business, particularly short haul business trips. as legacy airlines shrink and die away, new airlines emerge and take up the employment and market slack. the notion of the 'empty core' problem in economics is essentially a characterization of markets where too few competitors generate supra-normal profits for incumbents, which then attracts entry. however entry creates frenzied competition in a war-of-attrition game environment: the additional competition induced by entry results in market and revenue shares that produce losses for all the market participants. consequently entry and competition leads to exit and a solidification of market shares by the remaining competitors who then earn supra-normal profits that once again will attract entry. while there is some intuitive appeal to explaining the dynamic nature of the industry resulting from an innate absence of stability in the market structure, there are theoretical problems with this perspective. 16 the fundamental problem with the empty core concept is that its roots lie in models of exogenous market structure that impose (via assumptions) the conditions of the empty core rather than deriving it as the result of decisions made by potential or incumbent market participants. in particular, for the empty core to perpetuate itself, entrants must be either ill advised or have some unspecified reason for optimism. in contrast, modern to model a such a demand system we need a consumer utility function of the form, u â¼ uã°y ; t; v ã� â¼ gv ã°y 2pã�; where y represents dollar income per period and t 2 â½0; 1 represents travel trips per period. v is an index of travel convenience, related to flight frequency and p is the delivered price of travel. this reduces each consumer's choice problem to consumption of a composite commodity priced at $1, and the possibility of taking at most one trip per period. utility is increasing in v and decreasing in p, thus travellers are willing to tradeoff convenience for a lower delivered price. diversity in the willingness to trade off convenience for would be represented by distribution for y, g; and v over some range of parameter values. thus the growth of value-based demand for air travel would be represented by an increase in the density of consumers with relatively low value of these parameters. 16 the empty core theory is often applied to industries that exhibit significant economies of scale, airlines are thought generally to have limited if any scale economies but they do exhibit significant density economies. these density economies are viewed as providing conditions for an empty core. the proponents however only argue on the basis of fsas business model. industrial organization theory in economics is concerned with understanding endogenously determined market structures. in such models, the number of firms and their market conduct emerge as the result of a decisions to enter or exit the market and decisions concerning capacity, quantity and price. part of the general problem of modeling an evolving market structure is to understand that incumbents and potential entrants to the market construct expectations with respect to their respective market shares in any post-entry market. a potential entrant might be attracted by the known or perceived level of profits being earned by the incumbents, but must consider how many new consumers they can attract to their product in addition to the market share that can appropriated from the incumbent firms. this will depend in part upon natural (technological) and strategic barriers to entry, and on the response that can be expected if entry occurs. thus entry only occurs if the expected profits exceed the sunk costs of entry. while natural variation in demand conditions may induce firms to make errors in their predictions, resulting in entry and exit decisions, this is not the same thing as an 'empty core '. 17 in the air travel industry, incumbent firms (especially fsas) spend considerable resources to protect their market shares from internal and external competition. the use of frequent flier points along with marketing and branding serve this purpose. these actions raise the barriers to entry for airlines operating similar business models. what about the threat of entry or the expansion of operations by vbas? could this lead to exit by fsas? there may be legitimate concern from fsas concerning the sustainability of the full-service business model when faced with low-cost competition. in particular, the use of frequency as an attribute of service quality by fsas generates revenues from high-value business travellers, but these revenues only translate into profits when there are enough economy travellers to satisfy load factors. so, to the extent that vbas steal away market share from fsas they put pressure on the viability of this aspect of the fsa business model. the greatest threat to the fsa from a vba is that a lower the fare structure offered to a subset of passengers may induce the fsa to expand the proportion of seats offered to lower fares within the yield management system. this will occur with those vbas like southwest, virgin blue in australia and easyjet that do attempt to attract the business traveller from small and medium size firms. however, carriers like ryanair and westjet have a lower impact on overall fare structure since their frequencies are lower and the fsa can target the vbas flights. 18 while fsas may find themselves engaged in price and/or quality competition, the economics of price competition with differentiated products suggests that such markets can sustain oligopoly structures in which firms earn positive profits. this occurs because the prices of competing firms become strategic complements. that is, when one firm increases its price, the profit maximizing response of competitors is to raise price also and there are many dimensions on which airlines can product differentiate within the fsa business model. 19 there is no question fsas have higher seat mile costs than vbas. the problem comes about when fsas view their costs as being predominately fixed and hence marginal costs as being very low. this 'myopic' view ignores the need to cover the long run cost of capital. this in conjunction with the argument that network revenue contribution justifies most all routes, leads to excessive network size and severe price discounting. 20 however, when economies are buoyant, high yield traffic provides sufficient revenues to cover costs and provide substantial profit. in their assessment of the us airline industry, morrison and winston (1995) argue that the vast majority of losses incurred by fsas up to that point were due to their own fare, and fare war, strategies. it must be remembered that fsas co-exist with southwest in large numbers of markets in the us. what response would we expect from an fsa to limited competition from a vba on selected links of its hub-and-spoke network? given the fsa focus on maximization of aggregate network revenues and a cognisance that successful vba entry could steal away their base of economy fare consumers (used to generate the frequencies that provide high yield revenues), one might expect aggressive price competition to either prevent entry or to hasten the exit of a vba rival. this creates a problem for competition bureaus around the world as vbas file an increasing number of predatory pricing charges against fsas. similarly, the ability of this has led some to lobby for renewed government intervention in markets or anti-trust immunity for small numbers of firms. however, if natural variability is a key factor in explaining industry dynamics, there is nothing to suggest that governments have superior information or ability to manipulate the market structure to the public benefit. 18 there are some routes in which westjet does have high frequencies and has significantly impacted mainline carriers. (e.g. calgary-abbotsford) 19 a standard result in the industrial organization literature is that competing firms engaged in price competition will earn positive economic profits when their products are differentiated. 20 the beyond or network revenue argument is used by many fsas to justify not abandoning markets or charging very low prices on some routes. the argument is that if we did not have all the service from a to b we would never receive the revenue from passengers who are travelling from b to c. in reality this is rarely true. when fsas add up the value of each route including its beyond revenue the aggregate far exceeds the total revenue of the company. the result is a failure to abandon uneconomic routes. the three current most profitable airlines among the fsas, qantas, lufthansa and ba, do not use beyond revenue in assessing route profitability. fsas to compete as hub-and-spoke carriers against a competitive threat from vbas is constrained by the rules of the game as defined by competition policy. in canada, air canada faces a charge of predatory pricing for its competition against canjet and westjet in eastern canada. in the us, american airlines won its case in a predatory pricing charge brought by three vbas: vanguard airlines, sun jet and western pacific airlines. in germany, both lufthansa and deutsche ba have been charged with predatory pricing. in australia, qantas also faces predatory pricing charges. gillen and morrison (2003) points out three important dimensions of predatory pricing in air travel markets. first, demand complementarities in hub-andspoke networks lead fsas to focus on 'beyond revenues'-the revenue generated by a series of flights in an itinerary rather than the revenues generated by any one leg of the trip. fsas therefore justify aggressive price competition with a vba as a means of using the fare on that link (from an origin node to the hub node for example) as a way of maximizing the beyond revenues created when passengers purchase travel on additional links (from the hub to other nodes in the network). the problem with this argument is that promotional pricing is implicitly a bundling argument, where the airline bundles links in the network to maximize revenue. however when fsas compete fiercely on price against vbas, the price on that link is not limited to those customers who demand beyond travel. therefore, whether or not there is an intent to engage in predatory pricing, the effect is predatory as it deprives the vba of customers who do not demand beyond travel. a second dimension of predatory pricing is vertical product differentiation. fsas competition authorities to support the view that they the right to match prices of a rival vba. however, the bundle of services offered by fsas constitutes a more valuable package. in particular, the provision of frequent flyer programs creates a situation where matching the price of a vba is 'de facto' price undercutting, adjusting for product differentiation. a recent case between the vba germania and lufthansa resulted in the bundeskartellamt (the german competition authority) imposing a price premium restriction on lufthansa that prevented the fsa from matching the vbas prices. a third important dimension of predatory pricing in air travel markets is the ability which fsas have to shift capacity around a hub-and-spoke network, which necessarily requires a mixed fleet with variable seating capacities. in standard limit output models of entry deterrence, an investment in capacity is not a credible threat to of price competition if the entrant conjectures that the incumbent will not use that capacity once entry occurs. such models utilize the notion that a capacity investment is an irreversible commitment and that valuable reputation effects cannot be generated by the incumbent engaging in 'irrational' price competition. however in a hub-and-spoke network, an fsa can make a credible threat to transfer capacity to a particular link in the network in support of aggressive price competition, with the knowledge that the capacity can be redeployed elsewhere in the network when the competitive threat is over. this creates a positive barrier to entry with reputation effects occurring in those instances where entry occurs. such was the case when canjet and westjet met with aggressive price competition from air canada on flights from monkton nb to toronto (air canada and canjet) and hamilton (westjet). the fsa defense against such charges is that aircraft do not constitute an avoidable cost and should not be included in any price-cost test of predation. yet while aircraft are not avoidable with respect to the network, they are avoidable to the extent they can be redeployed around the network. if aircraft costs become included in measures of predation under competition laws, this will limit the success of price competition as a competitive response by an fsas responding to vba entry. in the current environment, competition policy rules are not well specified and the uncertainty does nothing to protect competition or to enhance the viability of air travel markets. however there has been increased academic interest in the issue and it seems likely that given the number of cases, some policy changes will be made (e.g., ross and stanbury, 2001) . once again, the way in which fsas have responded to competition from vbas reflects their network model, and competition policy decisions that prevent capacity shifting, price matching and inclusion of 'beyond revenues' will severely constrain the set of strategies an fsa can employ without causing some fundamental changes in the business model and corresponding network structure. 6. so where are we headed? in evolution, the notion of selection dynamics lead us to expect that unsuccessful strategies will be abandoned and successful strategies will be copied or imitated. we have already observed fsas attempts to replicate the vba business model through the creation of fighting brands. air canada created tango, zip, jazz, and jetz. few other carriers worldwide have followed such an extensive re-branding. in europe, british airways created go and klm created buzz, both of which have since been sold and swallowed up by other vbas. qantas has created a low cost long haul carrier-australian airlines. meanwhile, air new zealand, lufthansa, delta and united are moving in the direction of a low-price-low-cost brand. we are also seeing attempts by fsas to simplify their fare structures and exploit the cost savings from direct sales over the internet. thus there do seem to be evolutionary forces that are moving airlines away from the hub-and-spoke network in the direction of providing connections as distinct from true hubbing. american airlines is using a 'rolling hub' concept, which does exactly as its name implies. the purpose is to reduce costs through both fewer factors such as aircraft and labour and to increase productivity. the first step is to 'de-peak' the hub, which means not having banks as tightly integrated. this reduces the amount of own congestion created at hubs by the hubbing carrier and reduces aircraft needed. it also reduces service quality but it has become clear that the traditionally high yield business passenger who valued such time-savings is no longer willing to pay the very high costs that are incurred in producing them. however, as an example, american airlines has reduced daily flights at chicago so with the new schedules it has increased the total elapsed time of flights by an average of 10 min. elapsed time is a competitive issue for airlines as they vie for high-yield passengers who, as a group, have abandoned the airlines and caused revenues to slump. but that 10-min average lengthening of elapsed time appears to be a negative american is willing to accept in exchange for the benefits. at chicago, where the new spread-out schedule was introduced in april, american has been able to operate 330 daily flights with five fewer aircraft and four fewer gates and a manpower reduction of 4-5%. 21 the change has cleared the way for a smoother flow of aircraft departures and has saved taxi time. 22 it is likely that american will try to keep to the schedule and be disinclined to hold aircraft to accommodate late arriving connection passengers. while this may appear to be a service reduction it in fact may not, since on-time performance has improved. 23 the evolution of networks in today's environment will be based on the choice of business model that airlines make. this is tied to evolving demand conditions, the developing technologies of aircraft and infrastructure and the strategic choices of airlines. as we have seen, the hub-and-spoke system is an endogenous choice for fsa while the linear fc network provides the same scope for vbas. the threat to the hub-and-spoke network is the threat to bundled product of fsas. the hub-and-spoke network will only disappear if the fsa cannot implement a lower cost structure business model and at the same time provide the service and coverage that higher yield passengers demand. the higher yield passengers have not disappeared the market has only become somewhat smaller and certainly more fare sensitive, on average. fsas have responded to vbas by trying to copy elements of their business strategy including reduced inflight service, low cost [fighting] brands, and more pointto-point service. however, the ability of fsa to co-exist with vba and hence hub-and-spoke networks with linear networks is to redesign their products and provide incentives for passengers to allow a reduction in product, process and organizational complexity. this is a difficult challenge since they face complex demands, resulting in the design of a complex product and delivered in a complex network, which is a characteristic of the product. for example, no-shows are a large cost for fsa and they have to design their systems in such a way as to accommodate the no-shows. this includes over-booking and the introduction of demand variability. this uncertain demand arises because airlines have induced it with service to their high-yield passengers. putting in place a set of incentives to reduce noshows would lower costs because the complexity would be reduced or eliminated. one should have complexity only when it adds value. another costly feature of serving business travel is to maintain sufficient inventory of seats in markets to meet the time sensitive demands of business travellers. the hub-and-spoke structure is complex, the business processes are complex and these create costs. a huband-spoke network lowers productivity and increases variable and fixed costs, but these are not characteristics inherent in the hub-and-spoke design. they are inherent in the way fsa use the hub-and-spoke network to deliver and add value to their product. this is because the processes are complex even though the complexity is needed for a smaller, more demanding, higher yield set of customers. the redesigning of business processes moves the fsa between cost functions and not simply down their existing cost function but they will not duplicate the cost advantage of vbas. the network structure drives pricing, fleet and service strategies and the network structure is ultimately conditional on the size and preferences in the market. what of the future and what factors will affect the evolution of network design and scope? airline markets american has also reduced its turn around at spoke cities from 2.5 h previously to approximately 42 min. 22 as a result of smoother traffic flows, american has been operating at dallas/fort worth international airport with nine fewer mainline aircraft and two fewer regional aircraft. at chicago, the improved efficiency has allowed american to take five aircraft off the schedule, three large jets and two american eagle aircraft. american estimates savings of $100 million a year from reduced costs for fuel, facilities and personnel, part of the $2 billion in permanent costs it has trimmed from its expense sheet. the new flight schedule has brought unexpected cost relief at the hubs but also at the many ''spoke'' cities served from these major airports. aviation week and space technology, september 2, 2002 and february 18, 2003. 23 interestingly, from an airport perspective the passenger may not spend more total elapsed time but simply more time in the terminal and less time in the airplane. this may provide opportunities for nonaviation revenue strategies. with their networks are continuously evolving. what took place in the us 10 years ago is now occurring in europe. a 'modern' feature of networks is the strategic alliance. alliances between airlines allow them to extend their network, improve their product and service choice but at a cost. alliances are a feature associated with fsas not vbas. it may be that as fsas reposition themselves they will make greater use of alliances. vbas on the other hand will rely more on interlining to extend their market reach. interlining is made more cost effective with modern technologies but also with airports having an incentive to offer such services rather than have the airlines provide them. airports as modern businesses will have a more active role in shaping airline networks in the future. empty cores in airline markets bundling, integration and the delivered price of air travel: are low-cost carriers fullservice competitors the economics of hubs: the case of monopoly the evolution of the airline industry a note on the optimality of airline networks dealing with predatory conduct in the canadian airline industry: a proposal the economics of network industries the authors gratefully acknowledge financial support for travel to this conference, provided by funds from wilfrid laurier university and the sshrc institutional grant awarded to the university. key: cord-016448-7imgztwe authors: frishman, d.; albrecht, m.; blankenburg, h.; bork, p.; harrington, e. d.; hermjakob, h.; juhl jensen, l.; juan, d. a.; lengauer, t.; pagel, p.; schachter, v.; valencia, a. title: protein-protein interactions: analysis and prediction date: 2009-10-01 journal: modern genome annotation doi: 10.1007/978-3-211-75123-7_17 sha: doc_id: 16448 cord_uid: 7imgztwe proteins represent the tools and appliances of the cell — they assemble into larger structural elements, catalyze the biochemical reactions of metabolism, transmit signals, move cargo across membrane boundaries and carry out many other tasks. for most of these functions proteins cannot act in isolation but require close cooperation with other proteins to accomplish their task. often, this collaborative action implies physical interaction of the proteins involved. accordingly, experimental detection, in silico prediction and computational analysis of protein-protein interactions (ppi) have attracted great attention in the quest for discovering functional links among proteins and deciphering the complex networks of the cell. proteins represent the tools and appliances of the cellthey assemble into larger structural elements, catalyze the biochemical reactions of metabolism, transmit signals, move cargo across membrane boundaries and carry out many other tasks. for most of these functions proteins cannot act in isolation but require close cooperation with other proteins to accomplish their task. often, this collaborative action implies physical interaction of the proteins involved. accordingly, experimental detection, in silico prediction and computational analysis of protein-protein interactions (ppi) have attracted great attention in the quest for discovering functional links among proteins and deciphering the complex networks of the cell. proteins do not simply clump togetherbinding between proteins is a highly specific event involving well defined binding sites. several criteria can be used to further classify interactions (nooren and thornton 2003) . protein interactions are not mediated by covalent bonds and, from a chemical perspective, they are always reversible. nevertheless, some ppi are so persistent to be considered irreversible (obligatory) for all practical purposes. other interactions are subject to tight regulation and only occur under characteristic conditions. depending on their functional role, some protein interactions remain stable for a long time (e.g. between proteins of the cytoskeleton) while others last only fractions of a second (e.g. binding of kinases to their targets). protein complexes formed by physical binding are not restricted to so called binary interactions which involve exactly two proteins (dimer) but are often found to contain three (trimer), four (tetramer), or more peptide chains. another distinction can be made based on the number of distinct proteins in a complex: homo-oligomers contain multiple copies of the same protein while hetero-oligomers consist of different protein species. sophisticated "molecular machines" like the bacterial flagellum consist of a large number of different proteins linked by protein interactions. the focus of this chapter is on the computational methods for analyzing and predicting protein-protein interactions. nevertheless, some basic knowledge about experimental techniques for detecting these interactions is highly useful for interpreting results, estimating potential biases, and judging the quality of the data we use in our work. many different types of methods have been developed but the vast majority of interactions in the literature and public databases come from only two classes of approaches: co-purification and two-hybrid methods. co-purification methods (rigaut et al. 1999 ) are carried out in vitro and involve three basic steps. first, the protein of interest is "captured" from a cell lysatee.g. by attaching it to an immobile matrix. this may be done with specific antibodies, affinity tags, epitope tags along with a matching antibody, or by other means. second, all other proteins in the solution are removed in a washing step in order to purify the captured protein. under suitable conditions, protein-protein interactions are preserved. in the third step, any proteins still attached to the purified protein are detected by suitable methods (e.g. western-blot or mass spectrometry). hence, the interaction partners are co-purified, as the name of the method implies. the two-hybrid technique (fields and song 1989 ) uses a very different approachit exploits the fact that transcription factors such as gal4 consist of two distinct functional domains. the dna-binding domain (bd) recognizes the transcription factor (tf) binding site in the dna and attaches the protein to it while the activation domain (ad) triggers transcription of the gene under the control of the factor. when expressed as separate protein chains, both domains remain fully functional: the bd still binds the dna but lacks a way of triggering transcription. the ad could trigger transcription but has no means of binding to the dna. for a two-hybrid test, two proteins x and y are fused to these domains resulting in two hybrids: x-bd and y-ad. if x binds to y, the resulting protein complex turns out to be a fully functional transcription factor. accordingly, an interaction is revealed by detecting transcription of the reporter gene under the control of the tf. in contrast to co-purifications, the interaction is tested in vivo in the two-hybrid system (usually in yeast, but other systems exist). the above description refers to small-scale experiments testing one pair of proteins at a time, but both approaches have successfully been extended to large-scale experiments testing thousands of pairs in very short time. while such high-throughput data is very valuable, especially for computational biology which often requires comprehensive input data, a word of caution is necessary. even with the greatest care and a maximum of thoughtful controls, high-throughput data usually suffer from a certain degree of false-positive results as well as false-negatives compared to carefully performed and highly optimized individual experiments. the ultimate source of information about protein interactions is provided by high-resolution three-dimensional structures of interaction complexes, such as the one shown in fig. 1 . spatial architectures obtained by x-ray crystallography or nmr spectroscopy provide atomic-level detail of interaction interfaces and allow for mechanistic understanding of interaction processes and their functional implications. additional kinetic, dynamic and structural aspects of protein interactions can be elucidated by electron and atomic force microscopy as well as by fluorescence resonance energy transfer. fig. 1 structural complex between rhoa, a small gtp protein belonging to the ras superfamily, and the catalytic gtpase activating domain of rhogap (graham et al. 2002) 3 protein interaction databases a huge number of protein-protein interactions has been experimentally determined and described in numerous scientific publications. public protein interaction databases that provide interaction data in form of structured, machine-readable datasets organized according to well documented standards have become invaluable resources for bioinformatics, systems biology and researchers in experimental laboratories. the data in these databases generally originate from two major sources: large-scale datasets and manually curated information extracted from the scientific literature. as pointed out above, the latter is considered substantially more reliable and large bodies of manually curated ppi data are often used as the gold standard against which predictions and large-scale experiments are benchmarked. of course, these reference data are far from complete and strongly biased. many factors, including experimental bias, preferences of the scientific community, and perceived biomedical relevance influence the chance of an interaction to be studied, discovered and published. in the manual annotation process it is not enough to simply record the interaction as such. additional information such as the type of experimental evidence, citations of the source, experimental conditions, and more need to be stored in order to convey a faithful picture of the data. annotation is a highly labor intensive task carried out by specially trained database curators. ppi databases can be roughly divided in two classes: specialized databases focusing on a single organism or a small set of species and general repositories which aim for a comprehensive representation of current knowledge. while the former are often well integrated with other information resources for the same organism, the latter strive for collecting all available interaction data including datasets from specialized resources. the size of these databases is growing constantly as more and more protein interactions are identified. as of writing (november 2007) , global repositories are approaching 200,000 pieces of evidence for protein interactions in various species. all of these databases offer convenient web interfaces that allow for interactively searching the database. in addition, the full datasets are usually provided for download in order to enable researchers to use the data in their own computational analyses. table 1 gives an overview of some important ppi databases. until relatively recently, molecular interaction databases like the ones listed in table 1 acted largely independently from each other. while they provided an extremely valuable service to the community in collecting and curating available molecular interaction data from the literature, they did so largely in an uncoordinated manner. each database had its own curation policy, feature set, and data formats. in 2002, the proteomics standards initiative (psi), a work group of the human proteome organization (hupo), set out to improve this situation, with contributions from a broad range of academic and commercial organizations, among them bind, cellzome, dip, glaxosmithkline, hybrigenics sa, intact, mint, mips, serono, and the universities of bielefeld, bordeaux, and cambridge. in a first step, a community standard for the representation of protein-protein interactions was developed, the psi mi format 1.0 (hermjakob et al. 2004) . recently, version 2.5 of the psi mi format has been published , extending the scope of the format from protein-protein interactions to molecular interactions in general, allowing to model for example protein-rna complexes. the psi mi format is a flexible xml format representing the interaction data to a high level of detail. n-ary interactions (complexes) can be represented as well as experimental conditions and technologies, quantitative parameters and interacting domains. the xml format is accompanied by detailed controlled vocabularies in obo format (harris et al. 2004 ). these vocabularies are essential for standardizing not only the syntax, but also the semantics of the molecular interaction representation. as an example, the "yeast two-hybrid technology" described above is referred to in the literature using many different synonyms, for example y2h, 2h, "yeast-two-hybrid", etc. while all of these terms refer to the same technology, filtering interaction data from multiple different databases based on this set of terms is not trivial. thus, the psi mi standard provides a set of now more than 1000 well-defined terms relevant to molecular interactions. figure 2 shows the intact advanced search tool with a branch of the hierarchical psi mi controlled vocabulary. figure 3 provides a partial graphical representation of the annotated xml schema, combined with an example dataset in psi mi xml format, reprinted from kerrien et al. (2007b) . for user-friendly distribution of simplified psi data to end users, the psi mi 2.5 standard also defines a simple tabular representation (mitab), derived from the biogrid format (breitkreutz et al. 2003) . while this format necessarily excludes details of interaction data like interacting domains, it provides a means to efficiently access large numbers of basic binary interaction records. the psi mi format is now widely implemented, with data available from biogrid, dip, hprd, intact, mint, and mips, among others. visualization tools like cytoscape (shannon et al. 2003) can directly read and visualize psi mi formatted data. comparative and integrative analysis of interaction data from multiple sources has become easier, as has the development of analysis tools which do not need to provide a plethora of input parsers any more. the annotated psi mi xml schema, a list of tools and 359 databases implementing it, as well as further information, are available from http:// www.psidev.info/. however, the development and implementation of a common data format is only one step towards the provision of consistent molecular interaction data to the scientific community. another key step is the coordination of the data curation process itself between different molecular interaction databases. without such synchronization, independent databases will often work on the same publications and insert the data into their systems, according to different curation rules, thus doing redundant work on some publications, while neglecting others. recognizing this issue, the dip, intact, and mint molecular interaction databases are currently synchronizing their curation efforts in the context of the imex consortium (http://imex.sf.net). these databases are now applying the same curation rules to provide a consistent high level of curation quality, and are synchronizing their fields of activity, each focusing on literature curation from a non-overlapping set of scientific journals. for these journals, the databases aim to insert all published interactions into the database shortly after publication. regular data exchange of all newly curated data between imes databases is currently in the implementation phase. to support the systematic representation and capture of relevant molecular interaction data supporting scientific publications, the hupo proteomics standards initiative has recently published "the minimum information required for reporting a molecular interaction experiment (mimix)" , detailing data items considered essential for the authors to provide, as well as a practical guide to efficient deposition of molecular interaction data in imex databases . the imex databases are also collaborating with scientific journals and funding agencies, to increasingly recommend data producers to deposit their data in an imex partner database prior to publication. database deposition prior to publication not only ensures public availability of the data at the time of publication, but also provides important quality control, as database curators often assess the data in much more detail than reviewers. the psi journal collaboration efforts are starting to show first results. nature biotechnology, nature genetics, and proteomics are now recommending that authors deposit molecular interaction data in a relevant public domain database prior to publication, a key step to a better capture of published molecular interaction data in public databases, and to overcome the current fragmentation of molecular interaction data. as an example of a molecular interaction database implementing the psi mi 2.5 standard, we will provide a more detailed description of the intact molecular interaction database ), accessible at http://www.ebi.ac.uk/intact. intact is a curated molecular interaction database active since 2002. intact follows a full text curation policy, publications are read in full by the curation team, and all molecular interactions contained in the publication are inserted into the database, containing basic facts like the database accession numbers of the proteins participating in an interaction, but also details like experimental protein modifications, which can have an impact on assessments of confidence in the presence or absence of interactions. each database record is cross-checked by a senior curator for quality control. on release of the record, the corresponding author of the publication is automatically notified (where an email address is available), and requested to check the data provided. any corrections are usually inserted into the next weekly release. while such a detailed, high quality approach is slow and limits coverage, the provision of high quality reference datasets is an essential service both for biological analysis, and for the training and validation of automatic methods for computational prediction of molecular interactions. as it is impossible for any single database, or even the collaborating imex databases, to fully cover all published interactions, curation priorities have to be set. any direct data depositions supporting manuscripts approaching peer review have highest priority. next, for some journals (currently cell, cancer cell, and proteomics) intact curates all molecular interactions published in the journal. finally, several special curation topics are determined in collaboration with external communities or collaborators, where intact provides specialized literature curation and collaborates in the analysis of experimental datasets, for example around a specific protein of interest (camargo et al. 2006) . as of november 2007, intact contains 158.000 binary interactions supported by ca. 3,000 publications. the intact interface implements a standard "simple search" box, ideal for search by uniprot protein accession numbers, gene names, species, or pubmed identifiers. the advanced search tool (fig. 2) provides field-specific searches as well a specialized search taking into account the hierarchical structure of controlled vocabularies. a default search for the interaction detection method "2 hybrid" returns 30,251 interactions, while a search for "2 hybrid" with the tickbox "include children" activated returns more than twice that number, 64,589 interactions. the hierarchical search automatically includes similarly named methods like "two hybrid pooling approach", but also "gal4 vp16 complement". search results are initially shown in a tabular form based on the mitab format, which can also be directly downloaded. each pairwise interaction is only listed once, with all experimental evidence listed in the appropriate columns. the final column provides access to a detailed description of each interaction as well as a graphical representation of the interaction in is interaction neighborhood graph. for interactive, detailed analysis, interaction data can be loaded into tools like cytoscape (see below) via the psi 2.5 xml format. all intact data is freely available via the web interface, for download in psi mi tabular or xml format, and computationally accessible via web services. intact software is open source, implemented in java, with hibernate (www.hibernate.org/) for the object-relational mapping to oracle tm or postgres, and freely available under the apache license, version 2 from http://www.ebi.ac.uk/intact. on a global scale, protein-protein interactions participate in the formation of complex biological networks which, to a large extent, represent the paths of communication and metabolism of an organism. these networks can be modeled as graphs making them amenable to a large number of well established techniques of graph theory and social network analysis. even though interaction networks do not directly encode cellular processes nor provide information on dynamics, they do represent a first step towards a description of cellular processes, which is ultimately dynamic in nature. for instance, protein-interaction networks may provide useful information on the dynamics of complex assembly or signaling. in general, investigating the topology of protein interaction, metabolic, signaling, and transcriptional networks allows researchers to reveal the fundamental principles of molecular organization of the cell and to interpret genome data in the context of large-scale experiments. such analyses have become an integral part of the genome annotation process: annotating genomes today increasingly means annotating networks. a protein-protein interaction network summarizes the existence of both stable and transient associations between proteins as an (undirected) graph: each protein is represented as a node (or vertex), an edge between two proteins denotes the existence of an interaction. interactions known to occur in the actual cell ( fig. 4a ) can thus be represented as an abstract graph of interaction capabilities (fig. 4b ). as such a graph is limited by definition to binary interactions, its construction from a database of molecular interactions may involve arbitrary choices. for instance, an n-ary interaction measured by co-purification can be represented using either the clique (all binary interactions between the n proteins are retained) or the spoke model (only edges connecting the "captured" protein to co-purified proteins are retained). once a network has been reconstructed from protein interaction data, a variety of statistics on network topology can be computed, such as the distribution of vertex degrees, the distribution of the clustering coefficient and other notions of density, the distribution of shortest path length between vertex pairs, or the distribution of network motifs occurrences (see for a review). these measures can be used to describe networks in a concise manner, to compare, group or contrast different networks, and to identify properties characteristic of a network or a class of network under study. some topological properties may be interpreted as traces of underlying biological mechanisms, shedding light on their dynamics, their evolution, or both and helping connect structure to function (see the "network modules" section below). for instance, most interaction networks seem to exhibit scale-free topology (jeong et al. 2001; yook et al. 2004) , i.e. their degree distribution (the probability that a node has exactly k links) approximates a power law p(k) $ k -g , meaning that most proteins have few interaction partners but some, the so-called "hubs", have many. as an example of derived evolutionary insight, it is easy to show that networks evolving by growth (addition of new nodes) and preferential attachment (new nodes are more likely to be connected to nodes with more connections) will exhibit scale-free topology (degree distribution approximates a power-law) and hubs (highly connected nodes). a simple model of interaction network evolution by gene duplication, where a duplicate initially keeps the same interaction partners as the original, generates preferential attachment, thus providing a candidate explanation for the scale-free nature and the existence of hubs in these networks . interacting proteins are denoted as p1, p2, etc. (b) a graph representation of the protein interactions shown in a. each node represents a protein, and each edge connects proteins that interact. (c) information on protein interactions obtained by different methods. (d) protein interaction network derived from experimental evidence shown in c. as in a, each node is a protein, and edges connect interactors. edges a colored according to the source of evidence: red -3d, green -apms, brown -y2h, magenta -prof, yellow -lit, blue -loc a corresponding functional interpretation of hubs and scale-free topology has been proposed in terms of robustness. scale-free networks are robust to component failure, as random failures are likely to affect low degree nodes and only failures affecting hub nodes will significantly change the number of connected components and the length of shortest paths between node pairs. deletion analyses have, perhaps unsurprisingly, confirmed that highly connected proteins are more likely to be essential (winzeler et al. 1999; giaever et al. 2002; gerdes et al. 2003) . most biological interpretations that have been proposed for purely topological properties of interaction networks have been the subject of heated controversies, some of which remain unsolved to this day (e.g. (he and zhang 2006; yu et al. 2007 ) on hubs). one often cited objection to any strong interpretation is the fact that networks reconstructed from high-throughput interaction data constitute very rough approximations of the "real" network of interactions taking place within the cell. as illustrated in fig. 4c , interaction data used in a reconstruction typically result from several experimental methods, often complemented with prediction schemes. each specific method can miss real interactions (false negatives) and incorrectly identify other interactions (false positives), resulting in biases that are clearly technology-dependent (gavin et al. 2006; legrain and selig 2000) . assessing false-negative and false-positive rates is difficult since there is no gold standard for positive interactions (protein pairs that are known to interact) or, more importantly, for negative interactions (protein pairs that are known not to interact). using less-than-ideal benchmark interaction sets, estimates of 30-60% false positives and 40-80% false negatives have been proposed for yeast-two-hybrid and copurification based techniques (aloy and russell 2004) . in particular, a comparison of several high-throughput interaction datasets on yeast, showing low overlap, has confirmed that each study covers only a small percentage of the underlying interaction network (von mering et al. 2002 ) (see also "estimates of the number of protein interactions" below). integration of interaction data from heterogeneous sources towards interaction network reconstruction can help compensate for these limitations. the basic principle is fairly simple and rests implicitly on a multigraph representation: several interaction networks to be integrated, each resulting from a specific experimental or predictive method, are defined over the same set of proteins. integration is achieved by merging them into a single network with several types of linksor edge colors-each drawn from one of the component networks. some edges in the multigraph may be incorrect, while some existing interactions may be missing from the multigraph, but interactions confirmed independently by several methods can be considered reliable. figure 4d shows the multigraph that corresponds to the evidence from fig. 4c and can be used to reconstruct the actual graph in fig. 4b . in practice, integration is not always straightforward: networks are usually defined over subsets of the entire gene or protein complement of a species, and meaningful integration requires that the overlap of these subsets be sufficiently large. in addition, if differences of reliability between network types are to be taken into account, an integrated reliability scoring scheme needs to be designed (jansen et al. 2003; von mering et al. 2007 ) with the corresponding pitfalls and level of arbitrariness involved in comparing apples and oranges. existing methods can significantly reduce false positive rates on a subset of the network, yielding a subnetwork of highreliability interactions. the tremendous amounts of available molecular interaction data raise the important issue of how to visualize them in a biologically meaningful way. a variety of tools have been developed to address this problem; two prominent examples are visant (hu et al. 2005) and cytoscape (shannon et al. 2003) . a recent review of further network visualization tools is provided by suderman and hallett (2007) . in this section, we focus on cytoscape (http://www.cytoscape.org) and demonstrate its use for the investigation of protein-protein interaction networks. for a more extensive protocol on the usage of cytoscape, see (cline et al. 2007) . cytoscape is a stand-alone java application that is available for all major computer platforms. this software provides functionalities for (i) generating biological networks, either manually or by importing interaction data from various sources, (ii) filtering interactions, (iii) displaying networks using graph layout algorithms, (iv) integrating and displaying additional information like gene expression data, and (v) performing analyses on networks, for instance, by calculating topological network properties or by identifying functional modules. one advantage of cytoscape over alternative visualization software applications is that cytoscape is released under the open-source lesser general public license (lgpl). this license basically permits all forms of software usage and thus helps to build a large user and developer community. third-party java developers can easily enhance the functionality of cytoscape by implementing own plug-ins, which are additional software modules that can be readily integrated into the cytoscape platform. currently, there are more than forty plug-ins publicly available, with functionalities ranging from interaction retrieval and integration across topological network analysis, detection of network motifs, protein complexes, and domain interactions, to visualization of subcellular protein localization and bipartite networks. a selection of popular cytoscape plug-ins is listed in table 2 . in the following, we will describe the functionalities of cytoscape in greater detail. the initial step of generating a network can be accomplished in different ways. first, the user can import interaction data that are stored in various flat file or xml formats such as biopax, sbml, or psi-mi, as described above. second, the user can directly retrieve interactions from several public repositories from within cytoscape. a number table 2 ). third, the user can utilize a text-mining plug-in that builds networks based on associations found in publication abstracts (agilent literature search; table 2 ). while these associations are not as reliable as experimentally derived interactions, they can be helpful when the user is investigating species that are not well covered yet in the current data repositories. fourth, the user can directly create or manipulate a network by manually adding or removing nodes (genes, proteins, domains, etc.) and edges (interactions or relationships). in this way, expert knowledge that is not captured in the available data sets can be incorporated into the loaded network. generated networks can be further refined by applying selections and filters in cytoscape. the user can select nodes or edges by simply clicking on them or framing a selection area. in addition, starting with at least one selected node, the user can incrementally enlarge the selection to include all direct neighbor nodes. cytoscape also provides even sophisticated search and filter functionality for selecting particular nodes and edges in a network based on different properties; in particular, the enhanced search plug-in (table 2) improves the built-in search functionality of cytoscape. filters select all network parts that match certain criteria, for instance, all human proteins or all interactions that have been detected using the yeast two-hybrid system. once a selection has been made, all selected parts can be removed from the network or added to another network. the main purpose of visualization tools like cytoscape is the presentation of biological networks in an appropriate manner. this can usually be accomplished by applying graph layout algorithms. sophisticated layouts can assist the user in revealing specific network characteristics such as hub proteins or functionally related protein clusters. cytoscape offers various layout algorithms, which can be categorized as circular, hierarchical, spring-embedded (or force-directed), and attribute-based layouts (fig. 5 ). further layouts can be included using the cytoscape plug-in architecture, for example, to arrange protein nodes according to their subcellular localization or to their pathways assignments (bubblerouter, cerebral; table 2 ). some layouts may be more effective than others for representing molecular networks of a certain type. the spring-embedded layout, for instance, has the effect of exposing the inherent network structure, thus identifying hub proteins and clusters of tightly connected nodes. it is noteworthy that current network visualization techniques have limitations, for example, when displaying extremely large or dense networks. in such cases, a simple graphical network representation with one node for each interaction partner, as it is initially created by cytoscape, can obfuscate the actual network organization due to the sheer number of nodes and edges. one potential solution to this problem is the introduction of meta-nodes (metanode plug-in; table 2 ). a meta-node combines and replaces a group of other nodes. meta-nodes can be collapsed to increase clarity of the visualization and expanded to increase the level of detail (fig. 6 ). an overview of established and novel visualization techniques for biological networks on different scales is presented in (hu et al. 2007 ). all layouts generated by cytoscape are zoomable, enabling the user to increase or decrease the magnification, and they can be further customized by aligning, scaling, or rotating selected network parts. additionally, the user can define the graphical network representation through visual styles. these styles define the colors, sizes, and shapes of all network parts. a powerful feature of cytoscape is its ability of visually mapping additional attribute values onto network representations. both nodes and edges can have arbitrary attributes, for example, protein function names, the number of interactions (node degree), expression values, the strength and type of an interaction, or confidence values for interaction reliability. these attributes can be used to adapt the network illustration by dynamically changing the visual styles of individual network parts (fig. 7) . for example, this feature enables highlighting trustworthy interactions by assigning (table 2 ). all protein nodes with subcellular localizations different from plasma membrane are combined into meta-nodes. these meta-nodes can be collapsed or expanded to increase clarity or detailedness, respectively different line styles or sizes to different experiment types (discrete mapping of an edge attribute), to spot network hubs by changing the size of a node according to its degree (discrete or continuous mapping of a node attribute), or to identify functional network patterns by coloring protein nodes with a color gradient according to their expression level (continuous mapping of a node attribute). hence, it is possible to simultaneously visualize different data types by overlaying them with a network model. in order to generate new biological hypotheses and to gain insights into molecular mechanisms, it is important to identify relevant network characteristics and patterns. for this purpose, the straightforward approach is the visual exploration of the network. table 2 lists a selection of cytoscape plug-ins that assist the user in this analysis task, for instance, by identifying putative complexes (mcode), by grouping proteins that show a similar expression profile (jactivemodules), or by identifying overrepresented go terms (bingo, golorize). however, the inclusion of complex data such as time-series results or diverse gene ontology (go) terms into the network visualization might not be feasible without further software support. particularly in case of huge, highly connected, or dynamic networks, more advanced visualization techniques will be required in the future. fig. 7 visual representation of a subset of the gal4 network in yeast. the protein nodes are colored with a red-to-green gradient according to their expression value; green represents the lowest, red the highest value, and blue a missing value. the node size indicates the number of interactions (node degree); the larger a node, the higher is its degree. the colors and styles of the edges represent different interaction types; solid black lines represent protein-protein, dashed red lines protein-dna interactions in addition to the visual presentation of interaction networks, cytoscape can also be used to perform statistical analyses. for instance, the networkanalyzer plug-in (assenov et al. 2008 ) computes a large variety of topology parameters for all types of networks. the computed simple and complex topology parameters are represented as single values and distributions, respectively. examples of simple parameters are the number of nodes and edges, the average number of neighbors, the network diameter and radius, the clustering coefficient, and the characteristic path length. complex parameters are distributions of node degrees, neighborhood connectivities, average clustering coefficients, and shortest path lengths. these computed statistical results can be exported in textual or graphical form and are additionally stored as node attributes. the user can then apply the calculated attributes to select certain network parts or to map them onto the visual representation of the analyzed network as described above (fig. 7) . it is also possible to fit a power law to the node degree distribution, which can frequently indicate a so-called scale-free network with few highly connected nodes (hubs) and many other nodes with a small number of interactions. scale-free networks are especially robust against failures of randomly selected nodes, but quite vulnerable to defects of hubs (albert 2005) . how many ppis exist in a living cell? the yeast genome encodes approximately 6300 gene products which means that the maximal possible number of interacting protein pairs in this organism is close to 40 million, but what part of these potential interactions are actually realized in nature? for a given experimental method, such as the two-hybrid essay, the estimate of the total number of interactions in the cell is given by where n measured is the number of interactions identified in the experiment, and r fp and r fn are false positive and false negative rates of the method. r fn can be roughly estimated based on the number of interactions known with confidence (e.g., those confirmed by three-dimensional structures) that are being recovered by the method. assessing r fp is much more difficult because no experimental information on proteins that do not interact is currently available. since it is known that proteins belonging to the same functional class often interact, one very indirect way of calculating r fn is as the fraction of functionally related proteins not found to be interacting. an even more monumental problem is the estimation of the total number of unique structurally equivalent interaction types existing in nature. an interaction type is defined as a particular mutual orientation of two specific interacting domains. in some cases homologous proteins interact in a significantly different fashion while in other cases proteins lacking sequence similarity engage in interactions of the same type. in general, however, interacting protein pairs sharing a high degree of sequence similarity (30-40% or higher) between their respective components almost always form structurally similar complexes (aloy et al. 2003) . this observation allows utilization of available atomic resolution structures of complexes for building useful models of closely related binary complexes. the total number of interaction types can then be estimated as follows: where the interaction similarity multiplier c reflects the clustering of all interactions of the same type, and e all-species extrapolates from one biological species to all organisms. aloy and russel (2004) derived an estimate for c by grouping interactions between proteins that share high sequence similarity, as discussed above. c depends on the number of paralogous sequences encoded in a given genome. for small prokaryotic organisms it is close to 1 while for larger and more redundant genomes it adopts smaller values, typically in the range of 0.75-0.85. the multiplier for all species e allspecies can be derived by assessing what fraction of known protein families is encoded in a given genome. based on the currently available data this factor is close to 10 for bacteria, which means that a medium size prokaryotic organism contains around one tenth of all protein families. for eukaryotic organisms e all-species lies between 2 and 4. for the comprehensive two-hybrid screen of yeast by (uetz 2000) in which 936 interactions between 987 proteins were identified, aloy and russell (2004) estimated c, r fp, and r fn , and e all-species to be 0.85, 3.92, 0.55, and 3.35 respectively, leading to an estimated 1715 different interaction types in yeast alone, and 5741 over all species. based on the two-hybrid interaction map of the fly (giot 2003 ) the number of all interaction types in nature is estimated to be 9962. it is thus reasonable to expect the total number of interaction types to be around 10,000, and only 2000 are currently known. beyond binary interactions, proteins often form large molecular complexes involving multiple subunits (fig. 8) . these complexes are much more than a random snapshot of a group of interacting proteinsthey represent large functional entities which remain stable for long periods of time. many such protein complexes have been elucidated step by step over time and recent advances in high-throughput technology have led to largescale studies revealing numerous new protein complexes. the preferred technology for this kind of experiment is initial co-purification of the complexes followed by the identification of the member proteins by mass spectrometry. as the bakers yeast s. cerevisiae is one of the most versatile model organisms used in molecular biology, it is not surprising that the first large-scale complex datasets were obtained in this species (gavin et al. 2002; ho et al. 2002; gavin et al. 2006; krogan et al. 2006 ). the yeast protein interaction database mpact (guldener et al. 2006 ) provides access to 268 protein complexes based on careful literature annotation composed of 1237 different proteins plus over 1000 complexes from large-scale experiments which contain more than 2000 distinct proteins. these numbers contain some redundancy with respect to complexes, due to slightly different complex composition found by different groups or experiments. nevertheless, the dataset covers about 40% of the s.cerevisiae proteome. while many complexes comprise only a small number of different proteins, the largest of them features an impressive 88 different protein species. a novel manually annotated database, corum (ruepp et al. 2008 ) contains literature-derived information about 1750 mammalian multi-protein complexes. over 75% of all complexes contain between three and six subunits, while the largest molecular structure, the spliceosome, consists of 145 components (fig. 9 ). modularity has emerged as one of the major organizational principles of cellular processes. functional modules are defined as molecular ensembles with an autonomous function (hartwell et al. 1999) . proteins or genes can be partitioned into modules based on shared patterns of regulation or expression, involvement in a common metabolic or regulatory pathway, or membership in the same protein complex or subcellular structure. modular representation and analysis of cellular processes allows for interpretation of genome data beyond single gene behavior. in particular, analysis of modules provides a convenient framework for studying the evolution of living systems (snel and huynen 2004) . multiprotein complexes represent one particular type of functional modules in which individual components engage in physical interactions to execute a specific cellular function. algorithmically, modular architectures can be defined as densely interconnected groups of nodes on biological networks (for an excellent review of available methods see (sharan et al. 2007 ). statistically significant functional subnetworks are characterized by a high degree of local clustering. the density of a cluster can be represented as a function q(m,n) = 2m/(n(n à 1)), where m is the number of interactions between the n nodes of the cluster (spirin and mirny 2003) . q thus takes values between 0 for a set of unconnected nodes and 1 for a fully connected cluster (clique). the statistical significance of q strongly depends on the size of the graph. it is obvious that random clusters with q ¼ 1 involving just three proteins are very likely while large clusters with q ¼ 1 or even with values below 0.5 are extremely unlikely. in order to compute the statistical significance of a cluster with n nodes and m connections spirin and mirny calculate the expected number of such clusters in a comparable random graph and then estimate the likelihood of having m or more interactions within a given set of n proteins given the number of interactions that each of these proteins has. significant dense clusters identified by this procedure on a graph of protein interactions were found to correspond to functional modules most of which are involved in transcription regulation, cell-cycle/ cell-fate control, rna processing, and protein transport. however, not all of them constitute physical protein complexes and, in general, it is not possible to predict whether a given module corresponds to a multiprotein complex or just to a group of functionally coupled proteins involved in the same cellular process. the search for significant subgraphs can be further enhanced by considering evolutionary conservation of protein interactions. with this approach protein complexes are predicted from binary interaction data by network alignment which involves comparing interaction graphs between several species (sharan et al. 2005) . first, proteins are grouped by sequence similarity such that each group contains one protein from each species, and each protein is similar to at least one other protein in the group. then a composite interaction network is created by joining with edges those pairs of groups that are linked by at least one conserved interaction. again, dense clusters on such network alignment graph are often indicative of multiprotein complexes. an alternative computational method for deriving complexes from noisy large-scale interaction data relies on a "socio-affinity" index which essentially reflects the frequency with which proteins form partnerships detected by co-purification (gavin et al. 2006) . this index was shown to correlate well with available three-dimensional structure data, dissociation constants of protein-protein interactions, and binary interactions identified by the two-hybrid techniques. by applying a clustering procedure to a matrix containing the values of the socio-affinity index for all yeast protein pairs found to associate by affinity purification, 491 complexes were predicted, with over a half of them being novel and previously unknown. however, dependent on the analysis parameters distinct complex variants (isoforms) are found that differ from in terms of their subunit composition. those proteins present in most of the isoforms of a given complex constitute its core while variable components present only in a small number of isoforms can be considered "attachments" (fig. 10) . furthermore, some stable, typically smaller protein groups can be found in multiple attachments in which case they are fig. 10 definitions of complex cores, attachments, and modules. redrawn and modified with permission from (gavin et al. 2006) called "modules". stable functional modules can thus be flexibly used in the cell in a variety of functional contexts. proteins frequently associated with each other in complex cores and modules are likely to be co-expressed and co-localized. in this section, we offer a computational perspective on utilizing protein network data for molecular medical research. the identification of novel therapeutic targets for diseases and the development of drugs has always been a difficult, time-consuming and expensive venture (ruffner et al. 2007) . recent work has charted the current pharmacological space using different networks of drugs and their protein targets (paolini et al. 2006; keiser et al. 2007; kuhn et al. 2008; yildirim et al. 2007 ) based on biochemical relationships like ligand binding energy and molecular similarity or on shared disease association. above all, since many diseases are due to the malfunctioning of proteins, the systematic determination and exploration of the human interactome and homologous protein networks of model organisms can provide considerable new insight into pathophysiological processes (giallourakis et al. 2005) . knowledge of protein interactions can frequently improve the understanding of relevant molecular pathways and the interplay of various proteins in complex diseases (fishman and porter 2005) . this approach may result in the discovery of a considerable number of novel drug targets for the biopharmaceutical industry, possibly affording the development of multi-target combination therapeutics. observed perturbations of protein networks may also offer a refined molecular description of the etiology and progression of disease in contrast to phenotypic categorization of patients (loscalzo et al. 2007 ). molecular network data may help to improve the ability of cataloging disease unequivocally and to further individualize diagnosis, prognosis, prevention, and therapy. this will require a network-based approach that does not only include protein interactions to differentiate pathophenotypes, but also other types of molecular interactions as found in signaling cascades and metabolic pathways. furthermore, environmental factors like pathogens interacting with the human host or the effects of nutrition need to be taken into account. after large-scale screens identified enormous amounts of protein interactions in organisms like yeast, fly, and worm (goll and uetz 2007) , which also serve as model systems for studying many human disease mechanisms (giallourakis et al. 2005) , experimental techniques and computational prediction methods have recently been applied to generate sizable networks of human proteins (cusick et al. 2005; stelzl and wanker 2006; assenov et al. 2008; ram ırez et al. 2007 ). in addition, comprehensive maps of protein interactions inside pathogens and between pathogens and the human host have been compiled for bacteria like e. coli, h. pylori, c. jejuni, and other species (noirot and noirot-gros 2004) , for many viruses such as herpes viruses, the epstein-chapter 6.2: protein-protein interactions: analysis and prediction barr virus, the sars coronavirus, hiv-1, the hepatitis c virus, and others (uetz et al. 2004) , and for the malaria parasite p. falciparum (table 3) . those extensive network maps can now be explored to identify potential drug targets and to block or manipulate important protein-protein interactions. furthermore, different experimental methods are also used to expand the known interaction networks around pathway-centric proteins like epidermal growth factor receptors (egfrs) (tewari et al. 2004; oda et al. 2005; jones et al. 2006) , smad and transforming growth factor-b (tgfb) (colland and daviet 2004; tewari et al. 2004; barrios-rodiles et al. 2005) , and tumor necrosis factor-a (tnfa) and the transcription factor nf-kb (bouwmeester et al. 2004 ). all of these proteins are involved in sophisticated signal transduction cascades implicated in various important disease indications ranging from cancer to inflammation. the immune system and toll-like receptor (tlr) pathways were the subject of other detailed studies (oda and kitano 2006) . apart from that, protein networks for longevity were assembled to research ageing-related effects (xue et al. 2007 ). high-throughput screens are also conducted for specific disease proteins causative of closely related clinical and pathological phenotypes to unveil molecular interconnections between the diseases. for example, similar neurodegenerative disease phenotypes are caused by polyglutamine proteins like huntingtin and over twenty ataxins. although they that are not evolutionarily related and their expression is not restricted to the brain, they are responsible for inherited neurotoxicity and age-dependent dementia only in specific neuron populations (ralser et al. 2005) . yeast two-hybrid screens revealed an unexpectedly dense interaction network of those disease proteins forming interconnected subnetworks (fig. 11) , which suggests common pathways affected in disease (goehler et al. 2004; lim et al. 2006) . some of the protein-protein interactions may be involved in mediating neurodegeneration and thus may be tractable for drug inhibition, and several interaction partners of ataxins could additionally be shown to be potential disease modifiers in a fly model (kaltenbach et al. 2007) . a number of methodological approaches concentrate on deriving correlations between common topological properties and biological function from subnetworks around proteins that are associated with a particular disease phenotype like cancer. recent studies report that human disease-associated proteins with similar clinical and pathological features tend to be more highly connected among each other than with other proteins and to have more similar transcription profiles xu and li 2006; goh et al. 2007 ). this observation points to the existence of disease-associated functional modules. interestingly, in contrast to disease genes, essential genes whose defect may be lethal early on in life are frequently found to be hubs central to the network. further work focused on specific disease-relevant networks. for instance, to analyze experimental asthma, differentially expressed genes were mapped onto a protein interaction network ). here, highly connected nodes tended to have smaller expression changes than peripheral nodes. this agrees with the general notion that disease-causing genes are typically not central in the network. similarly, a comprehensive protein network analysis of systemic inflammation in human subjects investigated blood leukocyte gene expression patterns when receiving an inflammatory stimulus, a bacterial endotoxin, to identify functional modules perturbed in response to this stimulus (calvano et al. 2005) . topological criteria and gene expression data were also used to search protein networks for functional modules that are relevant to type 2 diabetes mellitus or to different types of cancer (jonsson and bates 2006; cui et al. 2007; lin et al. 2007; pujana et al. 2007 ). moreover, it was recently demonstrated that the integration of gene expression profiles with subnetworks of interacting proteins can lead to improved prognostic markers for breast cancer outcome that are more reproducible between patient cohorts than sets of individual genes selected without network information (chuang et al. 2007 ). in drug discovery, protein networks can help to design selective inhibitors of protein-protein interactions which target specific interactions of a protein, but do not affect others (wells and mcclendon 2007) . for example, a highly connected protein (hub) may be a suitable target for an antibiotic whereas a more peripheral protein with few interaction partners may be more appropriate for a highly specific drug that needs to avoid side effects. thus, topological network criteria are not only useful for characterizing disease proteins, but also for finding drug targets. the diversity of interactions of a targeted protein could also help in predicting potential side effects of a drug. apart from that, it is remarkable that some potential drugs have been found to be less effective than expected due to the intrinsic robustness of living systems against perturbations of molecular interactions (kitano 2007) . furthermore, mutations in proteins cause genetic diseases, but it is not always easy to distinguish protein interactions impaired by mutated binding sites from other disease causes like structural instability induced by amino acid mutations. nowadays many genome-wide association and linkage studies for human diseases suggest genomic loci and linkage intervals that contain candidate genes encoding snps and mutations of potential disease proteins (kann 2007) . since the resultant list of candidates frequently contain dozens or even hundreds of genes, computational approaches have been developed to prioritize them for further analyses and experiments. in the following, we will demonstrate the variety of available prioritization approaches by explicating three recent methods that utilize protein interaction data in addition to the inclusion of other sequence and function information. all methods capitalize on the above described observation that closely interacting gene products often underlie polygenic diseases and similar pathophenotypes (oti and brunner 2007) . using protein-protein interaction data annotated with reliability values, lage et al. (2007) first predict human protein complexes for each candidate protein. they then score the pairwise phenotypic similarity of the candidate disease with all proteins within each complex that are associated with any disease. the scoring function basically measures the overlap of the respective disease phenotypes as recorded in text entries of omim (online mendelian inheritance in man) (hamosh et al. 2005 ) based on the vocabulary of umls (unified medical language system) (bodenreider 2004) . lastly, all candidates are prioritized by the probability returned by a bayesian predictor trained on the interaction data and phenotypic similarity. therefore, this method depends on the premise that the phenotypic effects caused by any disease-affected member in a predicted protein complex are very similar to each other. another prioritization approach by franke et al. (2006) does not make use of overlapping disease phenotypes and primarily aims at connecting physically disjoint genomic loci associated with the same disease using molecular networks. at the beginning, their method prioritizer performs a bayesian integration of three different network types of gene/protein relationships. the latter are derived from functional similarity using gene ontology annotation, microarray coexpression, and proteinprotein interaction. this results in a probabilistic human network of general functional links between genes. prioritizer then assesses which candidate genes contained in different disease loci are closely connected in this gene-gene network. to this end, the score of each candidate is initially set to zero, but it is increased iteratively during network exploration by a scoring function that depends on the network distance of the respective candidate gene to candidates inside another genomic loci. this procedure finally yields separate prioritization lists of ranked candidate genes for each genomic loci. in contrast to the integrated gene-gene network used by prioritizer, the endeavour system (aerts et al. 2006 ) directly compares candidate genes with known disease genes and creates different ranking lists of all candidates using various sources of evidence for annotated relationships between genes or proteins. the evidence can be derived from literature mining, functional associations based on gene ontology annotations, co-occurrence of transcriptional motifs, correlation of expression data, sequence similarity, common protein domains, shared metabolic pathway membership, and protein-protein interactions. at the end, endeavour merges the resultant ranking lists using order statistics and computes an overall prioritization list of all candidate genes. finally, it is important to keep in mind that current datasets of human protein interactions may still contain a significant number of false interactions and thus biological and medical conclusions derived from them should always be taken with a note of caution, in particular, if no good confidence measures are available. a comprehensive atlas of protein interactions is fundamental for a better understanding of the overall dynamic functioning of the living organisms. these insights arise from the integration of functional information, dynamic data and protein interaction networks. in order to fulfill the goal of enlarging our view of the protein interaction network, several approaches must be combined and a crosstalk must be established among experimental and computational methods. this has become clear from comparative evaluations which show similar performances for both types of methodologies. in fact, over the recent years this field has grown into one of the most appealing fields in bioinformatics. evolutionary signals result from restrictions imposed by the need to optimize the features that affect a given interaction and the nature of these features can differ from interaction to interaction. consequently, a number of different methods have been developed based a range of different evolutionary signals. this section is devoted to a brief review of some of these methods. these techniques are based on the similarity of absence/presence profiles of interacting proteins. in its original formulation (gaasterland and ragan 1998; huynen and bork 1998; pellegrini et al. 1999; marcotte et al. 1999a ) the phylogenetic profiles were codified as 0/1 vectors for each reference protein according to the absence/presence of proteins of the studied family in a set of fully sequenced organisms (see fig. 12a ). the vectors for different reference sequences are compared by using the hamming distance (pellegrini et al. 1999) between vectors. this measure counts the number of differences between two binary vectors. the rationale for this method is that both interacting proteins must be present in an organism and that reductive evolution will remove unpaired proteins in the rest of the organisms. proposed improvements include the inclusion of quantitative measures of sequence divergence (marcotte et al. 1999b; date and marcotte 2003) and the ability to deal with biases in the taxonomic distribution of the organisms used (date and marcotte 2003; barker and pagel 2005) . these biases are due to the intuitive fact that evolutionarily similar organisms will share a higher number of protein and genomic features (in this case presence/absence of an orthologue). to reduce this problem, date et al. used mutual information from sequence divergent profiles for measuring the amount of information shared by both vectors. mutual information is calculated as: miðp1; p2þ ¼ hðp1þ þ hðp2þ à hðp1; p2þ; where hðp1þ ¼ p pðp1þ ln pðp1þ is the marginal entropy of the probability distribution of protein p1 sequence distances and hðp1; p2þ ¼ à p p pðp1; p2þ ln pðp1; p2þ is the joint entropy of the probability distributions of both protein p1 and p2 sequence distances. the corresponding probabilities are calculated from the whole distribution of orthologue distances for the organisms. in this way, the most likely evolutionary distances between orthologues from a pair of organisms will produce smaller entropies and consequently smaller values of mutual information. this formulation should implicitly reduce the effect of taxonomic biases. in an interesting work, published recently by barker et al. (2007) , the authors showed that detection of correlated gene-gain/gene-loss events improves the predictions by reducing the number of false positives due to taxonomic biases. the phylogenetic profiling approach has been shown to be quite powerful, because its simple formulation has allowed the exploration of a number of alternative interdependencies between proteins. this is the case for enzyme "displacement" in metabolic pathways detected as anti-correlated profiles (morett et al. 2003) , and for complex dependence relations among triplets of proteins (bowers et al. 2004) . phylogenetic profiles have also been correlated with bacterial traits to predict the genes related to particular phenotypes (korbel et al. 2005) . the main drawbacks of these methods are the difficulty of dealing with essential proteins (where there is no absence information) and the requirement for the genomes under study to be complete (to establish the absence of a family member). fig. 12 prediction of protein interactions based on genomic and sequence features. information coming from the set of close homologs of the proteins p1 and p2 from the organism 1 in other organisms can be used to predict an interaction between these proteins. (a) phylogenetic profiling. presence/absence of a homolog of both proteins in different organisms is coded as the corresponding two 1/0 profiles (most simple approach) and an interaction is predicted for very similar profiles. (b) similarity of phylogenetic trees. multiple sequence alignments are built for both sets of proteins and phylogenetic trees are derived from the proteins with a possible partner present in its organism. proteins with highly similar trees are predicted to interact. (c) gene neighbourhood conservation. genome closeness is checked for those genes coding for both sets of homologous proteins. interaction is predicted if gene pairs are recurrently close to each other in a number of organisms. (d) gene fusion. finding the proteins containing different sequence regions homologous to each of the two proteins is used to predict an interaction between them similarity in the topology of phylogenetic trees of interacting proteins has been qualitatively observed in a number of cases (fryxell 1996; pages et al. 1997; goh et al. 2000) . the extension of this observation to a quantitative method for the prediction of protein interactions requires measuring the correlation between the similarity matrices of the explored pairs of protein families (goh et al. 2000) . this formulation allows systematic evaluation of the validity of using the original observation as a signal of protein interaction (pazos and valencia 2001) . the general protocol for these methods is illustrated in fig. 12b . it includes the building of the multiple sequence alignment for the set of orthologues (one per organism) related to every query sequence, the calculation of all protein pair evolutionary distances (derived from the corresponding phylogenetic trees) and finally the comparison of evolutionary distance matrices of pairs of query proteins using pearsons correlation coefficient. protein pairs with highly correlated distance matrices are predicted to be more likely to interact. although this signal has been shown to be significant, the underlying process responsible for this similarity is still controversial (chen and dokholyan 2006) . there are two main hypotheses for explaining this phenomenon. the first hypothesis suggests that this evolutionary similarity comes from the mutual adaptation (co-evolution) of interacting proteins and the need to retain interaction features while sequences diverge. the second hypothesis implicates external factors. in this scenario, the restrictions imposed by evolution on the functional process implicating both proteins would be responsible for the parallelism of their phylogenetic trees. although the relative importance of both factors is still not clear, the predictive power of similarities in phylogenetic trees is not affected. indeed, a number of developments have improved the original formulation (pazos et al. 2005; sato et al. 2005 ). the first advance involved managing the intrinsic similarity of the trees because of the common underlying taxonomic distribution (due to the speciation processes). this effect is analogous to the taxonomic biases discussed above. in these cases, the approach followed was to correct both trees by removing this common trend. for example, pazos et al. subtracted the distances of the 16s rrna phylogenetic tree to the corresponding distances for each protein tree. the correlations for the resulting distance matrices were used to predict protein interactions. additionally some analyses have focused on the selection of the sequence regions used for the tree building (jothi et al. 2006; kann et al. 2007) . for example, it has been shown that interacting regions, both defined as interacting residues (using structural data) and as the sequence domain involved in the interaction, show more clear tree similarities than the whole proteins (mintseris and weng 2005; jothi et al. 2006) . other interesting work showed that prediction performance can be improved by removing poorly conserved sequence regions ). finally, in a very recent work (juan et al. 2008 ) the authors have suggested a new method for removing noise in the detection of tree similarity signals and detecting different levels of evolutionary parallelism specificity. this method introduces the new strategy of using the global network of protein evolutionary similarity for a better calibration of the evolutionary parallelism between two proteins. for this purpose, they define a protein co-evolutionary profile as the vector containing the evolutionary correlations between a given protein tree and all the rest of the protein trees derived from sequences in the same organism. this co-evolutionary profile is a more robust and comparable representation of the evolution of a given protein (it involves hundreds of distances) and can be used to deploy a new level of evolutionary comparison. the authors compare these co-evolutionary profiles by calculating pearsons correlation coefficient for each pair. in this way, the method detects pairs of proteins for which high evolutionary similarities are supported by their similarities with the rest of proteins of the organism. this approach significantly improves the predictive performance of the tree similaritybased methods so that different degrees of co-evolutionary specificity are obtained according to the number of proteins that might be influencing the co-evolution of the studied pair. this is done by extending the approach of sato et al. (2006) , that uses partial correlations and a reduced set of proteins for determining specific evolutionary similarities. juan et al. calculated the partial correlation for each significant evolutionary similarity with respect to the remaining proteins in the organism and defined levels of co-evolutionary specificity according to the number of proteins that are considered to be co-evolving with each studied protein pair. with this strategy, its possible to detect a range of evolutionary parallelisms from the protein pairs (for very specific similarities) up to subsets of proteins (for more relaxed specificities) that are highly evolution dependent. interestingly, if specificity requirements are relaxed, protein relationships among components of macro-molecular complexes and proteins involved in the same metabolic process can be recovered. this can be considered as a first step in the application of higher orders of evolutionary parallelisms to decode the evolutionary impositions over the protein interaction network. this method exploits the well-known tendency of bacterial organisms to organize proteins involved in the same biochemical process by clustering them in the genome. this observation is obviously related to the operon concept and the mechanisms for the coordination of transcription regulation of the genes present in these modules. these mechanisms are widespread among bacterial genomes. therefore the significance of a given gene proximity can be established by its conservation in evolutionary distant species (dandekar et al. 1998; overbeek et al. 1999) . the availability of fully sequenced organisms makes computing the intergenic distances between each pair of genes easy. genes with the same direction of transcrip-tion and closer than 300 bases are typically considered to be in the same genomic context (see fig. 12c ). the conservation of this closeness must be found in more than two highly divergent organisms to be considered significant because of the taxonomic biases. while this signal is strong in bacterial genomes, its relevance is unclear in eukaryotic genomes. this is the main drawback of these methodologies. in fact, this signal only can be exploited for eukaryotic organisms by extrapolating genomic closeness of bacterial genes to their homologues in eukaryotes. obviously, this extrapolation leads to a considerable reduction in the confidence and number of obtained predictions for this evolutionary lineage. however, conserved gene pairs that are transcribed from a shared bidirectional promoter can be detected by similar methods and can found in eukaryotes as well as prokaryotes (korbel et al. 2004) a further use of evolutionary signals in protein function and physical interaction prediction has been the tendency of interacting proteins to be involved in gene fusion events. sequences that appear as independently expressed orfs in one organism become fused as part of the same polypeptide sequence in another organism. these fusions are strong indicators of functional and structural interaction that have been suggested to increase the effective concentration of interacting functional domains (enright et al. 1999; marcotte et al. 1999b ). this hypothesis proposes that gene fusion could remove the effect of diffusion and relative correct orientation of the proteins forming the original complex. these fusion events are typically detected when sequence searches for two nonhomologous proteins obtain a significant hit in the same sequence. cases matching to the same region of the hit sequence are removed (these cases are schematically represented in fig. 12d ). in spite of the strength of this signal, gene fusion seems to not be a habitual event in bacterial organisms. the difficulty of distinguishing protein interactions belonging to large evolutionary families is the main drawback of the automatic application of these methodologies. 13 integration of experimentally determined and predicted interactions as described above, there are many both experimental techniques and computational methods for determining and predicting interactions. to obtain the most comprehensive interaction networks possible, as many as possible of these sources of interactions should be integrated. the integration of these resources is complicated by the fact that the different sources are not all equally reliable, and it is thus important to quantify the accuracy of the different evidence supporting an interaction. in addition to the quality issues, comparison of different interaction sets is further complicated by the different nature of the datasets: yeast two-hybrid experiments are inherently binary, whereas pull-down experiments tend to report larger complexes. to allow for comparisons, complexes are typically represented by binary interaction networks; however, it is important to realize that there is not a single, clear definition of a "binary interaction". for complex pull-down experiments, two different representations have been proposed: the matrix representation, in which each complex is represented by the set of binary interactions corresponding to all pairs of proteins from the complex, and the spoke representation, in which only bait-prey interactions are included (von mering et al. 2002) . the binary interactions obtained using either of these representations are somewhat artificial as some interacting proteins might in reality never touch each other and others might have too low an affinity to interact except in the context of the entire complex bringing them together. even in the case of yeast two-hybrid assays, which inherently report binary interactions, not all interactions correspond to direct physical interactions. the database string ("search tool for the retrieval of interacting genes/ proteins") (von mering et al. 2007) represents an effort to provide many of the different types of evidence for functional interactions under one common framework with an integrated scoring scheme. such an integrated approach offers several unique advantages: 1) various types of evidence are mapped onto a single, stable set of proteins, thereby facilitating comparative analysis; 2) known and predicted interactions often partially complement each other, leading to increased coverage; and 3) an integrated scoring scheme can provide higher confidence when independent evidence types agree. in addition to the many associations imported from the protein interaction databases mentioned above (bader et al. 2003; salwinski et al. 2004; guldener et al. 2006; mishra et al. 2006; stark et al. 2006; chatr-aryamontri et al. 2007 ), string also includes interactions from curated pathway databases (vastrik et al. 2007; kanehisa et al. 2008 ) and a large body of predicted associations that are produced de novo using many of the methods described in this chapter (dandekar et al. 1998; gaasterland and ragan 1998; pellegrini et al. 1999; marcotte et al. 1999c) . these different types of evidence are obviously not directly comparable, and even for the individual types of evidence the reliability may vary. to address these two issues, string uses a two-stage approach. first, a separate scoring scheme is used for each evidence type to rank the interactions according to their reliability; these raw quality scores cannot be compared between different evidence types. second, the ranked interaction lists are benchmarked against a common reference to obtain probabilistic scores, which can subsequently be combined across evidence types. to exemplify how raw quality scores work, we will here explain the scoring scheme used for physical protein interactions from high-throughput screens. the two funda-mentally different types of experimental interaction data sets, complex pull-downs and binary interactions are evaluated using separate scoring schemes. for the binary interaction experiments, e.g. yeast two-hybrid, the reliability of an interaction correlates well with the number of non-shared interaction partners for each interactor. string summarizes this in the following raw quality score: logððn 1 þ1þ á ðn 2 þ1þþ; where n 1 and n2 are the numbers of non-shared interaction partners. this score is similar to the ig1 measure suggested by saito et al. (2002) . in the case of complex pulldown experiments, the reliability of the inferred binary interactions correlates better with the number of times the interactors were co-purified compared to what would be expected at random: where n 12 is the number of purifications containing both proteins, n 1 and n 2 are the numbers of purifications containing either protein 1 or 2, and n is the total number of purifications. for this purpose, the bait protein was counted twice to account for bait-prey interactions being more reliable than prey-prey interactions. these raw quality scores are calculated for each individual high-throughput screen. scores vary within one dataset, because they include additional, intrinsic information from the data itself, such as the frequency with which an interaction is detected. for medium sized data sets that are not large enough to apply the topology based scoring schemes, the same raw score is assigned to all interactions within a dataset. finally, very small data sets are pooled and considered jointly as a single interaction set. we similarly have different scoring schemes for predicted interactions based on coexpression in microarray expression studies, conserved gene neighborhood, gene fusion events and phylogenetic profiles. based on these raw quality scores, a confidence score is assigned to each predicted association by benchmarking the performance of the predictions against a common reference set of trusted, true associations. string uses as reference the functional grouping of proteins maintained at kegg (kyoto encyclopedia of genes and genomes (kanehisa et al. 2008) . any predicted association for which both proteins are assigned to the same "kegg pathway" is counted as a true positive. kegg pathways are particularly suitable as a reference because they are based on manual curation, are available for a number of organisms, and cover several functional areas. other benchmark sets could also be used, for example "biological process" terms from gene ontology (ashburner et al. 2000) or reactome pathways (vastrik et al. 2007 ). the benchmarked confidence scores in string generally correspond to the probability of finding the linked proteins within the same pathway or biological process. the assignment of probabilistic scores for all evidence types solves many of the issues of data integration. first, incomparable evidence types are made comparable by assigning a score that represents how well the evidence type can predict a certain type of interactions (the type being specified by the reference set used). second, the separate benchmarking of interactions from, for example, different high-throughput protein interaction screens accounts for any differences in reliability between different studies. third, use of raw quality scores allows us to separate more reliable interactions from less reliable interactions even within a single dataset. the probabilistic nature of the scores also makes it easy to calculate the combined reliability of an interaction given multiple lines of evidence. it is computed under the assumption of independence for the various sources, in a na€ ıve bayesian fashion. in addition to having a good scoring scheme, it is crucial to make the evidence for an interaction transparent to the end users. to achieve this, the string interaction network is made available via a user-friendly web interface (http://string.embl.de). when performing a query, the user will first be presented with a network view, which provides a first, simplified overview (fig. 13) . from here the user has full control over parameters such as the number of proteins shown in the network (nodes) and the minimal reliability required for an interaction (edge) to be displayed. from the network, the user also has the ability to drill down on the evidence that underlies any given interaction using the dedicated viewer for each evidence type. for example, it is possible to inspect the publications that support a given interaction, the set of protein that were fig. 13 protein interaction network of the core cell-cycle regulation in human. the network was constructed by querying the string database (von mering et al. 2007 ) for very high confidence interactions (conf. score > 0.99) between four cyclin-dependent kinases, their associated cyclins, the wee1 kinase and the cdc25 phosphatases. the network correctly recapitulates cdc2 interacts with cyclin-a/b, cdk2 with cyclin-a/e, and cdk4/6 with cyclin-d. it also shows that the wee1 and cdc25 phosphatases regulate cdc2 and cdk2 but not cdk4 and cdk6. moreover, the network suggests that cdc25a phosphatase regulates cdc2 and cdk2, whereas cdc25b and cdc25c specifically regulate cdc2 co-purified in a particular experiment and the phylogenetic profiles or genomic context based on which an interaction was predicted. protein binding is commonly characterized by specific interactions of evolutionarily conserved domains (pawson and nash 2003) . domains are fundamental units of protein structure and function , which are incorporated into different proteins by genetic duplications and rearrangements (vogel et al. 2004) . globular domains are defined as structural units of fifty and more amino acids that usually fold independently of the remaining polypeptide chain to form stable, compact structures (orengo and thornton 2005) . they often carry important functional sites and determine the specificity of protein interactions (fig. 14) . essential information on fig. 14 exemplary interaction between the two human proteins hhr23b and ataxin-3. each protein domain commonly adopts a particular 3d structure and may fulfill a specific molecular function. generally, the domains responsible for an observed protein-protein interaction need to be determined before further functional characterizations are possible. in the depicted protein-protein interaction, it is known from experiments that the ubiquitin-like domain ubl of hhr23b (yellow) forms a complex with de-ubiquitinating josephin domain of ataxin-3 (blue) (nicastro et al. 2005) the cellular function of specific protein interactions and complexes can often be gained from the known functions of the interacting protein domains. domains may contain binding sites for proteins and ligands such as metabolites, dna/rna, and drug-like molecules (xia et al. 2004) . widely spread domains that mediate molecular interactions can be found alone or combined in conjunction with other domains and intrinsically disordered, mainly unstructured, protein regions connecting globular domains (dunker et al. 2005) . according to apic et al. (2001) multi-domain proteins constitute two thirds of unicellular and 80% of metazoan proteomes. one and the same domain can occur in different proteins, and many domains of different types are frequently found in the same amino acid chain. much effort is being invested in discovering, annotating, and classifying protein domains both from the functional (pfam (finn et al. 2006) , smart (letunic et al. 2006 ), cdd (marchler-bauer et al. 2007 , interpro (mulder et al. 2007 ) and structural (scop (andreeva et al. 2004) , cath (greene et al. 2007 )) perspective. notably, it may be confusing that the term domain is commonly used in two slightly different meanings. in the context of domain databases such as pfam and smart, a domain is basically defined by a set of homologous sequence regions, which constitute a domain family. in contrast, a specific protein may contain one or more domains, which are concrete sequence regions within its amino acid sequence corresponding to autonomously folding units. domain families are commonly represented by hidden markov models (hmms), and highly sensitive search tools like hmmer (eddy 1998 ) are used to identify domains in protein sequences. different sources of information about interacting domains with experimental evidence are available. experimentally determined interactions of single-domain proteins indicate domain-domain interactions. similarly, experiments using protein fragments help identifying interaction domains, but this knowledge is frequently hidden in the text of publications and not contained in any database. however, domain databases like pfam, smart, and interpro may contain some annotation obtained by manual literature curation. in the near future, high-throughput screening techniques will result in even larger amounts of protein fragment interaction data to delineate domain borders and interacting protein regions (colland and daviet 2004) . above all, three-dimensional structures of protein domain complexes are experimentally solved by x-ray crystallography or nmr and are deposited in the pdb database (berman et al. 2007) . structural contacts between two interacting proteins can be derived by mapping sequence positions of domains onto pdb structures. extensive investigations of domain combinations in proteins of known structures (apic et al. 2001 ) as well as of structurally resolved homo-or heterotypic domain interactions (park et al. 2001) revealed that the overlap between intra-and intermolecular domain interactions is rather limited. two databases, ipfam (finn et al. 2005 ) and 3did (stein et al. 2005) , provide pre-computed structural information about protein interactions at the level of pfam domains. analysis of structural complexes suggests that interactions between a given pair of proteins may be mediated by different domain pairs in different situations and in different organisms. nevertheless, many domain interactions, especially those involved in basic cellular processes such as dna metabolism and nucleotide binding, tend to be evolutionarily conserved within a wide range of species from prokaryotes to eukaryotes (itzhaki et al. 2006) . in yeast, pfam domain pairs are associated with over 60% of experimentally known protein interactions, but only 4.5% of them are covered by ipfam (schuster-bockler and bateman 2007) . domain interactions can be inferred from experimental data on protein interactions by identifying those domain pairs that are significantly overrepresented in interacting proteins compared to random protein pairs (deng et al. 2002; ng et al. 2003a; riley et al. 2005; sprinzak and margalit 2001) (fig. 15) . however, the predictive power of such an approach is strongly dependent on the quality of the data used as the source of information for protein interactions, and the coverage of protein sequences in terms of domain assignments. basically, the likelihood of two domains, d i and d j , to interact can be estimated as the fraction of protein pairs known to interact among all proteins in the dataset containing this domain pair. this basic idea has been improved upon by using a maximum-likelihood (ml) approach based on the expectation-maximization (em) algorithm. this method finds the maximum likelihood estimator of the observed protein-protein interactions by an iterative cycle of computing the expected likelihood (e-step) and maximizing the unobserved parameters (domain interaction propensities) in the m-step. when the algorithm converges (i.e. the total likelihood cannot be further improved by the algorithm), the ml estimate for the likelihood of the unobserved domain interactions is found (deng et al. 2002; riley et al. 2005 ). riley and colleagues further improved this method by excluding each potentially interacting domain pair from the dataset and recomputing the ml-estimate to obtain an additional confidence value for the respective domain-domain interaction. this domain pair exclusion (dpea) method measures the contribution of each domain pair to the overall likelihood of the protein interaction network based on domain-domain interactions. in particular, this approach enables the prediction of specific domain-domain interactions between selected proteins which would have been missed by the basic ml method. another ml-based algorithm is insite which takes differences in the reliability of the protein-protein interaction data into account (wang et al. 2007a) . it also integrates external evidence such as functional annotation or domain fusion events. an alternative method for deriving domain interactions is through co-evolutionary analysis that exploits the notion that mutations of residue pairs at the interaction interfaces are correlated to preserve favorable physico-chemical properties of the binding surface (jothi et al. 2006) . the pair of domains mediating interactions between two proteins p1 and p2 may therefore be expected to display a higher similarity of their phylogenetic trees than other, non-interacting domains (fig. 16) . the degree of agreement between the evolutionary history of two domains, d i and d j , can be computed by the pearsons correlation coefficient r ij between the similarity matrices of the domain sequences in different organisms: where n is the number of species, m i pq and m j pq are the evolutionary distances between species, and m i and m j are the mean values of the matrices, respectively. in figure 16 the evolutionary tree of the domain d2 is most similar to those of d5 and d6, corroborating the actual binding region. a well-known limitation of the correlated mutation analysis is that it is very difficult to decide whether residue co-variation happens as a result of functional co-evolution directed at preserving interaction sites, or because of sequence divergence due to speciation. to address this problem, suggested to distinguish the relative contribution of conserved and more variable regions in aligned sequences to the co-evolution signal based on the hypothesis that functional co-evolution is more prominent in conserved regions. finally, interacting domains can be identified by phylogenetic profiling, as described above for full-chain proteins. as in the case of complete protein chains, the similarity of evolutionary patterns shared by two domains may indicate that they interact with each other directly or at least share a common functional role (pagel et al. 2004) . as illustrated in fig. 17 , clustering protein domains with similar phylogenetic profiles allows researchers to build domain interaction networks which provide clues for describing molecular complexes. similarly, the domainteam method (pasek et al. 2005) considers chromosomal neighborhoods at the level of conserved domain groups. a number of resources provide and combine experimentally derived and predicted domain interaction data. interdom (http://interdom.i2r.a-star.edu.sg/) integrates domain-interaction predictions based on known protein interactions and complexes with domain fusion events (ng et al. 2003b) . dima (http://mips.gsf.de/genre/proj/dima2) is another database of domain interactions, which integrates experimentally demonfig. 16 co-evolutionary analysis of domain interactions. two orthologous proteins from different organisms known to interact with each other are shown. the first protein consists of two domains, d1 and d2, while the second protein includes the domains d3, d4, d5, and d6. evolutionary trees for each domain are shown, their similarity serves as an indication of interaction likelihood that is encoded in the interaction matrix strated domain interactions from ipfam and 3did with predictions based on the dpea algorithm and phylogenetic domain profiling ). recently, two new comprehensive resources, domine (http://domine.utdallas.edu) (raghavachari et al. 2008 ) and dasmi (http://www.dasmi.de) (blankenburg et al. 2008, submitted) , were introduced and are available online. these resources contain ipfam and 3did data and predicted domain interactions taken from several other publications. predictions are based on several methods for deriving domain interactions from protein interaction data, phylogenetic domain profiling data and domain coevolution. with the availability of an increasing number of predictions the task of method weighting and quality assessment becomes crucial. a thorough analysis of the quality of domain interaction data can be found in schlicker et al. (2007) . beyond domain-domain contacts, an alternative mechanism of mediating molecular recognition is through binding of protein domains to short sequence regions (santonico et al. 2005) , typically from three to eight residues in length (zarrinpar et al. 2003; neduva et al. 2005) . such linear recognition motifs can be discovered from protein interaction data by identifying amino acid sequence patterns overrepresented in proteins that do not possess significant sequence similarity, but share the same interacting partner (yaffe 2006) . web services like eml (http://elm.eu.org (puntervoll et al. 2003) ), support the identification of linear motifs in protein sequences. as described above, specific adapter domains can mediate protein-protein interactions. while some of these interaction domains recognize small target peptides, others are involved in domain-domain interactions. as short binding motifs have a rather high probability of being found by chance and the exact mechanisms of binding specificity for this mode of interaction are not understood completely, predictions of proteinprotein interactions based on binding domains is currently limited to domain-domain interactions for which reliable data is available. predicting ppis from domain interactions may simply be achieved by reversing the ideas discussed above, that is, by using the domain composition of proteins to evaluate the interaction likelihood of proteins (bock and gough 2001; sprinzak and margalit 2001; wojcik and schachter 2001) . in a naive approach, domain interactions are treated as independent, and all protein pairs with a matching pair of interacting domains are predicted to engage in an interaction. given that protein interactions may also be mediated by several domain interactions simultaneously, more advanced statistical methods take into account dependencies between domains and exploit domain combinations (han et al. 2004 ) and multiple interacting domain pairs (chen and liu 2005) . exercising and validating these prediction approaches revealed that the most influential factor for ppi prediction is the quality of the underlying data. this suggests that, as for most biological predictions in other fields, the future of prediction methods for protein and domain interactions may lie in the integration of different sources of evidence and weighting the individual contributions based on calibration to goldstandard data. further methodological improvements may include the explicit consideration of cooperative domains, that is, domain pairs that jointly interact with other domains (wang et al. 2007b ). basic interactions between two or up to a few biomolecules are the basic elements of the complex molecular interaction networks that enable the processes of life and, when thrown out of their intended equilibrium, manifest the molecular basis of diseases. such interactions are at the basis of the formation of metabolic, regulatory or signal transduction pathways. furthermore the search for drugs boils down to analyzing the interactions between the drug molecule and the molecular target to which it binds, which is often a protein. for the analysis of a single molecular interaction, we do not need complex biological screening data. thus it is not surprising that the analysis of the interactions between two molecules, one of them being a protein, has the longest tradition in computational biology of all problems involving molecular interactions, dating back over three decades. the basis for such analysis is the knowledge of the three-dimensional structure of the involved molecules. to date, such knowledge is based almost exclusively on experimental measurements, such as x-ray diffraction data or nmr spectra. there are also a few reported cases in which the analysis of molecular interactions based on structural models of protein has led to successes. the analysis of the interaction of two molecules based on their three-dimensional structure is called molecular docking. the input is composed of the three-dimensional structures of the participating molecules. (if the involved molecule is very flexible one admissible structure is provided.) the output consists of the three-dimensional structure of the molecular complex formed by the two molecules binding to each other. furthermore, usually an estimate of the differential free energy of binding is given, that is, the energy difference dg between the bound and the unbound conformation. for the binding event to be favorable that difference has to be negative. this slight misnomer describes the binding between a protein molecule and a small molecule. the small molecule can be a natural substrate such as a metabolite or a molecule to be designed to bind tightly to the protein such as a drug molecule. proteinligand docking is the most relevant version of the docking problem because it is a useful help in searching for new drugs. also, the problem lends itself especially well to computational analysis, because in pharmaceutical applications one is looking for small molecules that are binding very tightly to the target protein, and that do so in a conformation that is also a low-energy conformation in the unbound state. thus, subtle energy differences between competing ligands or binding modes are not of prime interest. for these reasons there is a developed commercial market for protein-ligand docking software. usually the small molecule has a molecular weight of up to several hundred daltons and can be quite flexible. typically, the small molecule is given by its 2d structure formula, e.g., in the form of a smiles string (weininger 1988) . if a starting 3d conformation is needed there is special software for generating such a conformation (see, e.g. (pearlman 1987; sadowski et al. 1994) ). challenges of the protein ligand problem are (i) finding the correct conformation of the usually highly flexible ligand in the binding site of the protein, (ii) determining the subtle conformational changes in the binding site of the protein upon binding of the ligand, which are termed induced fit, (iii) producing an accurate estimate of the differential energy of binding or at least ranking different conformations of the same ligand and conformations of different ligands correctly by their differential energy of binding. methods tackling problem (ii) can also be used to rectify smaller errors in structural models of proteins whose structure has not been resolved experimentally. the solution of problem (iii) provides the essential selection criterion for preferred ligands and binding modes, namely those with lowest differential energy of binding. challenge (i) has basically been conquered in the last decade as a number of docking programs have been developed that can efficiently sample the conformational space of the ligand and produce correct binding modes of the ligand within the protein, assuming that the protein is given in the correct structure for binding the ligand. several methods are applied here. the most brute-force method is to just try different (rigid) conformations of the ligand one after the other. if the program is fast enough one can run through a sizeable number of conformations per ligand (mcgann et al. 2003) . a more algorithmic and quite successful method is to build up the ligand from its molecular fragments inside the binding pocket of the protein (rarey et al. 1996 ). yet another class of methods sample ligand conformations inside the protein binding pocket by methods such as local search heuristics, monte carlo sampling or genetic algorithms (abagyan et al. 1994; jones et al. 1997; morris et al. 1998 ). there are also programs exercising combinations of different methods (friesner et al. 2004 ). the reported methods usually can compute the binding mode of a ligand inside a protein within fractions of a minute to several minutes. the resulting programs can be applied to screening through large databases of ligands involving hundreds of thousands to millions of compounds and are routinely used in pharmaceutical industry in the early stages of drug design and selection. they are also repeatedly compared on benchmark datasets (kellenberger et al. 2004; englebienne et al. 2007 ). more complex methods from computational biophysics, such as molecular dynamics (md) simulations that compute a trajectory of the molecular movement based on the forces exerted on the molecules take hours on a single problem instance and can only be used for final refinement of the complex. challenges (ii) and (iii) have not been solved yet. concerning problem (ii), structural changes in the protein can involve redirections of side chains in or close to the binding pocket and more substantial changes involving backbone movement. while recently methods have been developed to optimize side-chain placement upon ligand binding (claußen et al. 2001; sherman et al. 2006) , the problem of finding the correct structural change upon binding involving backbone and side-chain movement is open (carlson 2002) . concerning problem (iii), there are no scoring functions to date that are able to sufficiently accurately estimate the differential energy of binding on a diverse set of protein-ligand complexes huang and zou 2006) . this is especially unfortunate as an inaccurate estimate of the binding energy causes the docking program to disregard correct complex structures even though they have been sampled by the docking program because they are labeled with incorrect energies. this is the major problem in docking which limits the accuracy of the predictions. recent reviews on protein-ligand docking have been published in sousa et al. (2006) and rarey et al. (2007) . one restriction with protein-ligand docking as it applies to drug design and selection is that the three-dimensional structure of the target protein needs to be known. many pharmaceutical targets are membrane-standing proteins for which we do not have the three-dimensional structure. for such proteins there is a version of drug screening that can be viewed as the negative imprint of docking: instead of docking the drug candidate into the binding site of the proteinwhich is not availablewe superpose the drug candidate (which is here called the test molecule) onto another small molecule which is known to bind to the binding site of the protein. such a molecule can be the natural substrate for the target protein or another drug targeting that protein. let us call this small molecule the reference molecule. the suitability of the new drug candidate is then assessed on the basis of its structural and chemical similarity with the reference molecule. one problem is that now both the test molecule and the reference molecule can be highly flexible. but in many cases largely rigid reference molecules can be found, and in other cases it suffices to superpose the test moelcule onto any low-energy conformation of the reference molecule. there are several classes of drug screening programs based on this molecular comparison, ranging from (i) programs that perform a detailed analysis of the three-dimensional structures of the molecules to be compared (e.g. (lemmen et al. 1998; kr€ amer et al. 2003) ) across (ii) programs that perform a topological analysis of the two molecules (rarey and dixon 1998; gillet et al. 2003) to (iii) programs that represent both molecules by binary or numerical property vectors which are compared with string methods (mcgregor and muskal 1999; xue et al. 2000) . the first class of programs require fractions of seconds to fractions of a minute for a single comparison, the second can perform hundreds comparisons per second, the third up to several ten thousand comparisons per second. reviews of methods for drug screening based on ligand comparison are given in (lengauer et al. 2004; k€ amper et al. 2007 ). here both binding partners are proteins. since drugs tend to be small molecules this version of the docking problem is not of prime interest in drug design. also, the energy balance of protein-protein binding is much more involved that for protein-ligand binding. optimal binding modes tend not to form troughs in the energy landscape that are as pronounced as for protein-ligand docking. the binding mode is determined by subtle side-chain rearrangements of both binding partners that implement the induced fit along typically quite large binding interfaces. the energy balance is dominated by difficult to analyze entropic terms involving the desolvation of water within the binding interface. for these reasons, the software landscape for protein-protein docking is not as well developed as for protein-ligand docking and there is no commercial market for protein-protein docking software. protein-protein docking approaches are based either on conformational sampling and mdwhich can naturally incorporated molecular flexibility but suffers from very high computing demandsor on combinatorial sampling with both proteins considered rigid in which case handling of protein flexibility has to be incorporated with methodical extensions. for space reasons we do not detail methods for protein-protein docking. a recent review on the subject can be found in hildebrandt et al. (2007) . a variant of protein-protein docking is protein-dna docking. this problem shares with protein-protein docking the character that both binding partners are macromolecules. however, entropic aspects of the energy balance are even more dominant in protein-dna docking than in protein-protein docking. furthermore dna can assume nonstandard shapes when binding to proteins which deviate much more from the known double helix than we are used to when considering induced fit phenomena. icm-a method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation gene prioritization through genomic data fusion scale-free networks in cell biology the relationship between sequence and interaction divergence in proteins ten thousand interactions for the molecular biologist structural systems biology: modelling protein interactions scop database in 2004: refinements integrate structure and sequence family data domain combinations in archaeal, eubacterial and eukaryotic proteomes gene ontology: tool for the unification of biology. the gene ontology consortium computing topological parameters of biological networks bind: the biomolecular interaction network database network biology: understanding the cells functional organization constrained models of evolution lead to improved prediction of functional linkage from correlated gain and loss of genes predicting functional gene links from phylogenetic-statistical analyses of whole genomes high-throughput mapping of a dynamic signaling network in mammalian cells the worldwide protein data bank (wwpdb): ensuring a single, uniform archive of pdb data predicting protein-protein interactions from primary structure the unified medical language system (umls): integrating biomedical terminology superti-furga g (2004) a physical and functional map of the human tnf-alpha/ nf-kappa b signal transduction pathway use of logic relationships to decipher protein network organization the grid: the general repository for interaction datasets interaction network containing conserved and essential protein complexes in escherichia coli epstein-barr virus and virus human protein interaction maps a network-based analysis of systemic inflammation in humans disrupted in schizophrenia 1 interactome: evidence for the close connectivity of risk genes and a potential synaptic basis for schizophrenia protein flexibility is an important component of structure-based drug discovery mint: the molecular interaction database on evaluating molecular-docking methods for pose prediction and enrichment factors prediction of protein-protein interactions using random decision forest framework the coordinated evolution of yeast proteins is constrained by functional modularity network-based classification of breast cancer metastasis flexe: efficient molecular docking considering protein structure variations integration of biological networks and gene expression data using cytoscape integrating a functional proteomic approach into the target discovery process identification of the helicobacter pylori anti-sigma28 factor a map of human cancer signaling interactome: gateway into systems biology conservation of gene order: a fingerprint of proteins that physically interact discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages inferring domain-domain interactions from protein-protein interactions flexible nets. the roles of intrinsic disorder in protein interaction networks profile hidden markov models evaluation of docking programs for predicting binding of golgi alpha-mannosidase ii inhibitors: a comparison with crystallography protein interaction maps for complete genomes based on gene fusion events a novel genetic system to detect protein-protein interactions ipfam: visualization of protein-protein interactions in pdb at domain and amino acid resolutions pfam: clans, web tools and services pharmaceuticals: a new grammar for drug discovery a genomic approach of the hepatitis c virus generates a protein interaction map reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes glide: a new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy the coevolution of gene family trees microbial genescapes: phyletic and functional patterns of orf distribution among prokaryotes analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets proteome survey reveals modularity of the yeast cell machinery functional organization of the yeast proteome by systematic analysis of protein complexes experimental determination and system level analysis of essential genes in escherichia coli mg1655 disease gene discovery through integrative genomics similarity searching using reduced graphs a protein interaction map of drosophila melanogaster a protein interaction network links git1, an enhancer of huntingtin aggregation, to huntingtons disease co-evolution of proteins with their interaction partners analyzing protein interaction networks the human disease network mgf(3)(-) as a transition state analog of phosphoryl transfer the cath domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution mpact: the mips protein interaction resource on yeast online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders prespi: a domain combination based prediction system for protein-protein interaction the gene ontology (go) database and informatics resource from molecular to modular cell biology why do hubs tend to be essential in protein networks? the hupo psis molecular interaction format -a community standard for the representation of protein interaction data modeling protein-protein and protein-dna docking systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry towards zoomable multidimensional maps of the cell visant: data-integrating visual framework for biological networks and modules an iterative knowledge-based scoring function to predict protein-ligand interactions: ii. validation of the scoring function measuring genome evolution evolutionary conservation of domain-domain interactions a bayesian networks approach for predicting protein-protein interactions from genomic data lethality and centrality in protein networks development and validation of a genetic algorithm for flexible docking a quantitative protein interaction network for the erbb receptors using protein microarrays global topological features of cancer proteins in the human interactome co-evolutionary analysis of domains in interacting proteins reveals insights into domain-domain interactions mediating protein-protein interactions high-confidence prediction of global interactomes based on genome-wide coevolutionary networks huntingtin interacting proteins are genetic modifiers of neurodegeneration lead identification by virtual screaning kegg for linking genomes to life and the environment protein interactions and disease: computational approaches to uncover the etiology of diseases predicting protein domain interactions from coevolution of conserved regions relating protein pharmacology by ligand chemistry comparative evaluation of eight docking tools for docking and virtual screening accuracy intact-open source resource for molecular interaction data broadening the horizon -level 2.5 of the hupo-psi format for molecular interactions a robustness-based approach to systems-oriented drug design systematic association of genes to phenotypes by genome and literature mining analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs fast 3d molecular superposition and similarity search in databases of flexible molecules stitch: interaction networks of chemicals and proteins a protein interaction network of the malaria parasite plasmodium falciparum a human phenome-interactome network of protein complexes implicated in genetic disorders genome-wide protein interaction maps using two-hybrid systems flexs: a method for fast flexible ligand superposition novel technologies for virtual screening smart 5: domains in the context of genomes and networks a protein-protein interaction network for human inherited ataxias and disorders of purkinje cell degeneration a multidimensional analysis of genes mutated in breast and colorectal cancers network-based analysis of affected biological processes in type 2 diabetes models human disease classification in the postgenomic era: a complex systems approach to human pathobiology hubs in biological interaction networks exhibit low changes in expression in experimental asthma cdd: a conserved domain database for interactive domain family analysis detecting protein function and protein-protein interactions from genome sequences a combined algorithm for genome-wide prediction of protein function a combined algorithm for genome-wide prediction of protein function gaussian docking functions pharmacophore fingerprinting. 1. application to qsar and focused library design structure, function, and evolution of transient and obligate protein-protein interactions human protein reference database-2006 update systematic discovery of analogous enzymes in thiamin biosynthesis automated docking using a lamarckian genetic algorithm and an empirical binding free energy function new developments in the interpro database systematic discovery of new recognition peptides mediating protein interaction networks integrative approach for computationally inferring protein domain interactions interdom: a database of putative interacting protein domains for validating predicted protein interactions and complexes the solution structure of the josephin domain of ataxin-3: structural determinants for molecular recognition protein interaction networks in bacteria diversity of protein-protein interactions a comprehensive map of the toll-like receptor signaling network a comprehensive pathway map of epidermal growth factor receptor signaling submit your interaction data the imex way: a step by step guide to trouble-free deposition the minimum information required for reporting a molecular interaction experiment (mimix) protein families and their evolution-a structural perspective the modular nature of genetic diseases use of contiguity on the chromosome to predict functional coupling a database and tool, im browser, for exploring and integrating emerging gene and protein interaction data for drosophila the mips mammalian protein-protein interaction database dima 2.0 predicted and known domain interactions a domain interaction map based on phylogenetic profiling species-specificity of the cohesin-dockerin interaction between clostridium thermocellum and clostridium cellulolyticum: prediction of specificity determinants of the dockerin domain global mapping of pharmacological space mapping protein family interactions: intramolecular and intermolecular protein family interaction repertoires in the pdb and yeast a proteome-wide protein interaction map for campylobacter jejuni identification of genomic features using microsyntenies of domains: domain teams assembly of cell regulatory systems through protein interaction domains assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome similarity of phylogenetic trees as indicator of protein-protein interaction rapid generation of high quality approximate 2-dimension molecular structures assigning protein functions by comparative genome analysis: protein phylogenetic profiles network modeling links breast cancer susceptibility and centrosome dysfunction elm server: a new resource for investigating short functional sites in modular eukaryotic proteins domine: a database of protein domain interactions an integrative approach to gain insights into the cellular function of human ataxin-2 computational analysis of human protein interaction networks docking and scoring for structure-based drug design feature trees: a new molecular similarity measure based on tree matching a fast flexible docking method using an incremental construction algorithm a generic protein purification method for protein complex characterization and proteome exploration inferring protein domain interactions from databases of interacting proteins corum: the comprehensive resource of mammalian protein complexes human protein-protein interaction networks and the value for drug discovery comparison of automatic three-dimensional models builders using 639 x-ray structures interaction generality, a measurement to assess the reliability of a protein-protein interaction the database of interacting proteins: 2004 update methods to reveal domain networks partial correlation coefficient between distance matrices as a new indicator of protein-protein interactions the inference of protein-protein interactions by co-evolutionary analysis is improved by excluding the information about the phylogenetic relationships functional evaluation of domain-domain interactions and human protein interaction networks reuse of structural domain-domain interactions in protein networks cytoscape: a software environment for integrated models of biomolecular interaction networks conserved patterns of protein interaction in multiple species network-based prediction of protein function novel procedure for modeling ligand/ receptor induced fit effects quantifying modularity in the evolution of biomolecular systems protein-ligand docking: current status and future challenges protein complexes and functional modules in molecular networks correlated sequence-signatures as markers of protein-protein interaction biogrid: a general repository for interaction datasets 3did: interacting protein domains of known three-dimensional structure the value of high quality protein-protein interaction networks for systems biology protein-protein interactions: analysis and prediction tools for visually exploring biological networks systematic interactome mapping and genetic perturbation analysis of a c. elegans tgf-b signaling network a comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae herpesviral protein networks and their interaction with the human proteome from orfeomes to protein interaction maps in viruses reactome: a knowledge base of biologic pathways and processes structure, function and evolution of multidomain proteins analysis of intraviral protein-protein interactions of the sars coronavirus orfeome string 7 -recent developments in the integration and prediction of protein interactions comparative assessment of large-scale data sets of protein-protein interactions insite: a computational method for identifying protein-protein interaction binding sites on a proteome-wide scale comparative evaluation of 11 scoring functions for molecular docking analysis on multi-domain cooperation for predicting protein-protein interactions smiles, a chemical language and information system. 1. introduction and encoding rules reaching for high-hanging fruit in drug discovery at protein-protein interfaces protein-protein interaction map inference using interacting domain profile pairs analyzing cellular biochemistry in terms of molecular networks discovering disease-genes by topological features in human protein-protein interaction network a modular network model of aging evaluation of descriptors and mini-fingerprints for the identification of molecules with similar activity bits" and pieces drug-target network functional and topological characterization of protein interaction networks the importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics the structure and function of proline recognition domains protein-protein interactions: analysis and prediction key: cord-285522-3gv6469y authors: bello-orgaz, gema; jung, jason j.; camacho, david title: social big data: recent achievements and new challenges date: 2015-08-28 journal: inf fusion doi: 10.1016/j.inffus.2015.08.005 sha: doc_id: 285522 cord_uid: 3gv6469y big data has become an important issue for a large number of research areas such as data mining, machine learning, computational intelligence, information fusion, the semantic web, and social networks. the rise of different big data frameworks such as apache hadoop and, more recently, spark, for massive data processing based on the mapreduce paradigm has allowed for the efficient utilisation of data mining methods and machine learning algorithms in different domains. a number of libraries such as mahout and sparkmlib have been designed to develop new efficient applications based on machine learning algorithms. the combination of big data technologies and traditional machine learning algorithms has generated new and interesting challenges in other areas as social media and social networks. these new challenges are focused mainly on problems such as data processing, data storage, data representation, and how data can be used for pattern mining, analysing user behaviours, and visualizing and tracking data, among others. in this paper, we present a revision of the new methodologies that is designed to allow for efficient data mining and information fusion from social media and of the new applications and frameworks that are currently appearing under the “umbrella” of the social networks, social media and big data paradigms. data volume and the multitude of sources have experienced exponential growth, creating new technical and application challenges; data generation has been estimated at 2.5 exabytes (1 exabyte = 1.000.000 terabytes) of data per day [1] . these data come from everywhere: sensors used to gather climate, traffic and flight information, posts to social media sites (twitter and facebook are popular examples), digital pictures and videos (youtube users upload 72 hours of new video content per minute [2] ), transaction records, and cell phone gps signals, to name a few. the classic methods, algorithms, frameworks, and tools for data management have become both inadequate for processing this amount of data and unable to offer effective solutions for managing the data growth. the problem of managing and extracting useful knowledge from these data sources is currently one of the most popular topics in computing research [3, 4] . in this context, big data is a popular phenomenon that aims to provide an alternative to traditional solutions based on databases and data analysis. big data is not just about storage or access to data; its solutions aim to analyse data in order to make sense of them and exploit their value. big data refers to datasets that are terabytes to of challenges in obtaining valuable knowledge for people and companies (see value feature). • velocity: refers to the speed of data transfers. the data's contents are constantly changing through the absorption of complementary data collections, the introduction of previous data or legacy collections, and the different forms of streamed data from multiple sources. from this point of view, new algorithms and methods are needed to adequately process and analyse the online and streaming data. • variety: refers to different types of data collected via sensors, smartphones or social networks, such as videos, images, text, audio, data logs, and so on. moreover, these data can be structured (such as data from relational databases) or unstructured in format. • value: refers to the process of extracting valuable information from large sets of social data, and it is usually referred to as big data analytics. value is the most important characteristic of any big-data-based application, because it allows to generate useful business information. • veracity: refers to the correctness and accuracy of information. behind any information management practice lie the core doctrines of data quality, data governance, and metadata management, along with considerations of privacy and legal concerns. some examples of potential big data sources are the open science data cloud [8] , the european union open data portal, open data from the u.s. government, healthcare data, public datasets on amazon web services, etc. social media [9] has become one of the most representative and relevant data sources for big data. social media data are generated from a wide number of internet applications and web sites, with some of the most popular being facebook, twitter, linkedin, youtube, instagram, google, tumblr, flickr, and wordpress. the fast growth of these web sites allow users to be connected and has created a new generation of people (maybe a new kind of society [10] ) who are enthusiastic about interacting, sharing, and collaborating using these sites [11] . this information has spread to many different areas such as everyday life [12] (e-commerce, e-business, e-tourism, hobbies, friendship, ...), education [13] , health [14] , and daily work. in this paper, we assume that social big data comes from joining the efforts of the two previous domains: social media and big data. therefore, social big data will be based on the analysis of vast amounts of data that could come from multiple distributed sources but with a strong focus on social media. hence, social big data analysis [15, 16] is inherently interdisciplinary and spans areas such as data mining, machine learning, statistics, graph mining, information retrieval, linguistics, natural language processing, the semantic web, ontologies, and big data computing, among others. their applications can be extended to a wide number of domains such as health and political trending and forecasting, hobbies, e-business, cybercrime, counterterrorism, time-evolving opinion mining, social network analysis, and humanmachine interactions. the concept of social big data can be defined as follows: "those processes and methods that are designed to provide sensitive and relevant knowledge to any user or company from social media data sources when data sources can be characterised by their different formats and contents, their very large size, and the online or streamed generation of information." the gathering, fusion, processing and analysing of the big social media data from unstructured (or semi-structured) sources to extract value knowledge is an extremely difficult task which has not been completely solved. the classic methods, algorithms, frameworks and tools for data management have became inadequate for processing the vast amount of data. this issue has generated a large number of open problems and challenges on social big data domain related to different aspects as knowledge representation, data management, data processing, data analysis, and data visualisation [17] . some of these challenges include accessing to very large quantities of unstructured data (management issues), determination of how much data is enough for having a large quantity of high quality data (quality versus quantity), processing of data stream dynamically changing, or ensuring the enough privacy (ownership and security), among others. however, given the very large heterogeneous dataset from social media, one of the major challenges is to identify the valuable data and how analyse them to discover useful knowledge improving decision making of individual users and companies [18] . in order to analyse the social media data properly, the traditional analytic techniques and methods (data analysis) require adapting and integrating them to the new big data paradigms emerged for massive data processing. different big data frameworks such as apache hadoop [19] and spark [20] have been arising to allow the efficient application of data mining methods and machine learning algorithms in different domains. based on these big data frameworks, several libraries such as mahout [21] and sparkmlib [22] have been designed to develop new efficient versions of classical algorithms. this paper is focused on review those new methodologies, frameworks, and algorithms that are currently appearing under the big data paradigm, and their applications to a wide number of domains such as e-commerce, marketing, security, and healthcare. finally, summarising the concepts mentioned previously, fig. 1 shows the conceptual representation of the three basic social big data areas: social media as a natural source for data analysis; big data as a parallel and massive processing paradigm; and data analysis as a set of algorithms and methods used to extract and analyse knowledge. the intersections between these clusters reflect the concept of mixing those areas. for example, the intersection between big data and data analysis shows some machine learning frameworks that have been designed on top of big data technologies (mahoot [21] , mlbase [23, 24] , or sparkmlib [22] ). the intersection between data analysis and social media represents the concept of current web-based applications that intensively use social media information, such as applications related to marketing and e-health that are described in section 4. the intersection between big data and social media is reflected in some social media applications such as linkedin, facebook, and youtube that are currently using big data technologies (mon-godb, cassandra, hadoop, and so on) to develop their web systems. finally, the centre of this figure only represents the main goal of any social big data application: knowledge extraction and exploitation. the rest of the paper is structured as follows; section 2 provides an introduction to the basics on the methodologies, frameworks, and software used to work with big data. section 3 provides a description of the current state of the art in the data mining and data analytic techniques that are used in social big data. section 4 describes a number of applications related to marketing, crime analysis, epidemic intelligence, and user experiences. finally, section 5 describes some of the current problems and challenges in social big data; this section also provides some conclusions about the recent achievements and future trends in this interesting research area. currently, the exponential growth of social media has created serious problems for traditional data analysis algorithms and techniques (such as data mining, statistics, machine learning, and so on) due to their high computational complexity for large datasets. this type of methods does not properly scale as the data size increases. for this reason, the methodologies and frameworks behind the big data concept are becoming very popular in a wide number of research and industrial areas. this section provides a short introduction to the methodology based on the mapreduce paradigm and a description of the most popular framework that implements this methodology, apache hadoop. afterwards apache spark is described as emerging big data framework that improves the current performance of the hadoop framework. finally, some implementations and tools for big data domain related to distributed data file systems, data analytics, and machine learning techniques are presented. mapreduce [25, 26] is presented as one of the most efficient big data solutions. this programming paradigm and its related algorithms [27] , were developed to provide significant improvements in large-scale data-intensive applications in clusters [28] . the programming model implemented by mapreduce is based on the definition of two basic elements: mappers and reducers. the idea behind this programming model is to design map functions (or mappers) that are used to generate a set of intermediate key/value pairs, after which the reduce functions will merge (reduce can be used as a shuffling or combining function) all of the intermediate values that are associated with the same intermediate key. the key aspect of the mapreduce algorithm is that if every map and reduce is independent of all other ongoing maps and reduces, then the operations can be run in parallel on different keys and lists of data. although three functions, map(), combining()/shuffling(), and reduce(), are the basic processes in any mapreduce approach, usually they are decomposed as follows: 1. prepare the input: the mapreduce system designates map processors (or worker nodes), assigns the input key value k1 that each processor would work on, and provides each processor with all of the input data associated with that key value. 2. the map() step: each worker node applies the map() function to the local data and writes the output to a temporary storage space. the map() code is run exactly once for each k1 key value, generating output that is organised by key values k2. a master node arranges it so that for redundant copies of input data only one is processed. 3. the shuffle() step: the map output is sent to the reduce processors, which assign the k2 key value that each processor should work on, and provide that processor with all of the map-generated data associated with that key value, such that all data belonging to one key are located on the same worker node. 4. the reduce() step: worker nodes process each group of output data (per key) in parallel, executing the user-provided reduce() code; each function is run exactly once for each k2 key value produced by the map step. 5. produce the final output: the mapreduce system collects all of the reduce outputs and sorts them by k2 to produce the final outcome. fig. 2 shows the classical "word count problem" using the mapreduce paradigm. as fig. 2 shown, initially a process will split the data into a subset of chunks that will later be processed by the mappers. once the key/values are generated by mappers, a shuffling process is used to mix (combine) these key values (combining the same keys in the same worker node). finally, the reduce functions are used to count the words that generate a common output as a result of the algorithm. as a result of the execution or wrappers/reducers, the output will generate a sorted list of word counts from the original text input. (01, "i thought i") (02, "thought of thinking") (03, "of thanking you") finally, and before the application of this paradigm, it is essential to understand if the algorithms can be translated to mappers and reducers or if the problem can be analysed using traditional strategies. mapreduce provides an excellent technique to work with large sets of data when the algorithm can work on small pieces of that dataset in parallel, but if the algorithm cannot be mapped into this methodology, it may be "trying to use a sledgehammer to crack a nut". any mapreduce system (or framework) is based on a mapreduce engine that allows for implementing the algorithms and distributing the parallel processes. apache hadoop [19] is an open-source software framework written in java for the distributed storage and distributed processing of very large datasets using the mapreduce paradigm. all of the modules in hadoop have been designed taking into account the assumption that hardware failures (of individual machines or of racks of machines) are commonplace and thus should be automatically managed in the software by the framework. the core of apache hadoop comprises a storage area, the hadoop distributed file system (hdfs), and a processing area (mapreduce). the hdfs (see section 2.4.1) spreads multiple copies of the data across different machines. this not only offers reliability without the need for raid-controlled disks but also allows for multiple locations to run the mapping. if a machine with one copy of the data is busy or offline, another machine can be used. a job scheduler (in hadoop, the jobtracker) keeps track of which mapreduce jobs are executing; schedules individual maps; reduces intermediate merging operations to specific machines; monitors the successes and failures of these individual tasks; and works to complete the entire batch job. the hdfs and the job scheduler can be accessed by the processes and programs that need to read and write data and to submit and monitor the mapreduce jobs. however, hadoop presents a number of limitations: 1. for maximum parallelism, you need the maps and reduces to be stateless, to not depend on any data generated in the same mapreduce job. you cannot control the order in which the maps run or the reductions. 2. hadoop is very inefficient (in both cpu time and power consumed) if you are repeating similar searches repeatedly. a database with an index will always be faster than running a mapreduce job over un-indexed data. however, if that index needs to be regenerated whenever data are added, and data are being added continually, mapreduce jobs may have an edge. 3. in the hadoop implementation, reduce operations do not take place until all of the maps have been completed (or have failed and been skipped). as a result, you do not receive any data back until the entire mapping has finished. 4. there is a general assumption that the output of the reduce is smaller than the input to the map. that is, you are taking a large data source and generating smaller final values. apache spark [20] is an open-source cluster computing framework that was originally developed in the amplab at university of california, berkeley. spark had over 570 contributors in june 2015, making it a very high-activity project in the apache software foundation and one of the most active big data open source projects. it provides high-level apis in java, scala, python, and r and an optimised engine that supports general execution graphs. it also supports a rich set of high-level tools including spark sql for sql and structured data processing, spark mllib for machine learning, graphx for graph processing, and spark streaming. the spark framework allows for reusing a working set of data across multiple parallel operations. this includes many iterative machine learning algorithms as well as interactive data analysis tools. therefore, this framework supports these applications while retaining the scalability and fault tolerance of mapreduce. to achieve these goals, spark introduces an abstraction called resilient distributed datasets (rdds). an rdd is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. in contrast to hadoops two-stage disk-based mapreduce paradigm (mappers/reducers), sparks in-memory primitives provide performance up to 100 times faster for certain applications by allowing user programs to load data into a clusters memory and to query it repeatedly. one of the multiple interesting features of spark is that this framework is particularly well suited to machine learning algorithms [ [29] ]. from a distributed computing perspective, spark requires a cluster manager and a distributed storage system. for cluster management, spark supports stand-alone (native spark cluster), hadoop yarn, and apache mesos. for distributed storage, spark can interface with a wide variety, including the hadoop distributed file system, apache cassandra, openstack swift, and amazon s3. spark also supports a pseudo-distributed local mode that is usually used only for development or testing purposes, when distributed storage is not required and the local file system can be used instead; in this scenario, spark is running on a single machine with one executor per cpu core. a list related to big data implementations and mapreduce-based applications was generated by mostosi [30] . although the author finds that "it is [the list] still incomplete and always will be", his "big-data ecosystem table" [31] contains more than 600 references related to different big data technologies, frameworks, and applications and, to the best of this authors knowledge, is one of the best (and more exhaustive) lists related to available big data technologies. this list comprises 33 different topics related to big data, and a selection of those technologies and applications were chosen. those topics are related to: distributed programming, distributed files systems, a document data model, a key-value data model, a graph data model, machine learning, applications, business intelligence, and data analysis. this selection attempts to reflect some of the recent popular frameworks and software implementations that are commonly used to develop efficient mapreduce-based systems and applications. • apache pig. pig provides an engine for executing data flows in parallel on hadoop. it includes a language, pig latin, for expressing these data flows. pig latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. • apache storm. storm is a complex event processor and distributed computation framework written basically in the clojure programming language [32] . it is a distributed real-time computation system for rapidly processing large streams of data. storm is an architecture based on a master-workers paradigm, so that a storm cluster mainly consists of master and worker nodes, with coordination done by zookeeper [33] . • stratosphere [34] . stratosphere is a general-purpose cluster computing framework. it is compatible with the hadoop ecosystem:, accessing data stored in the hdfs and running with hadoops new cluster manager yarn. the common input formats of hadoop are supported as well. stratosphere does not use hadoops mapreduce implementation; it is a completely new system that brings its own runtime. the new runtime allows for defining more advanced operations that include more transformations than only map and reduce. additionally, stratosphere allows for expressing analysis jobs using advanced data flow graphs, which are able to resemble common data analysis task more naturally. • apache hdfs. the most extended and popular distributed file system for mapreduce frameworks and applications is the hadoop distributed file system. the hdfs offers a way to store large files across multiple machines. hadoop and hdfs were derived from the google file system (gfs) [35] . cassandra is a recent open source fork of a stand-alone distributed non-sql dbms system that was initially coded by facebook, derived from what was known of the original google bigtable [36] and google file system designs [35] . cassandra uses a system inspired by amazons dynamo for storing data, and mapreduce can retrieve data from cassandra. cassandra can run without the hdfs or on top of it (the datastax fork of cassandra). • apache giraph. giraph is an iterative graph processing system built for high scalability. it is currently used at facebook to analyse the social graph formed by users and their connections. giraph was originated as the open-source counterpart to pregel [37] , the graph processing framework developed at google (see section 3.1 for a further description). • mongodb. mongodb is an open-source document-oriented database system and is part of the nosql family of database systems [38] . it provides high performance, high availability, and automatic scaling. instead of storing data in tables as is done in a classical relational database, mongodb stores structured data as json-like documents, which are data structures composed of fields and value pairs. its index system supports faster queries and can include keys from embedded documents and arrays. moreover, this database allows users to distribute data across a cluster of machines. • apache mahout [21] . the mahout(tm) machine learning (ml) library is an apache(tm) project whose main goal is to build scalable libraries that contain the implementation of a number of the conventional ml algorithms (dimensionality reduction, classification, clustering, and topic models, among others). in addition, this library includes implementations for a set of recommender systems (user-based and item-based strategies). the first versions of mahout implemented the algorithms built on the hadoop framework, but recent versions include many new implementations built on the mahout-samsara environment, which runs on spark and h2o. the new spark-item similarity implementations enable the next generation of co-occurrence recommenders that can use entire user click streams and contexts in making recommendations. • spark mllib [22] . mllib is sparks scalable machine learning library, which consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, and dimensionality reduction, as well as underlying optimization primitives. it supports writing applications in java, scala, or python and can run on any hadoop2/yarn cluster with no preinstallation. the first version of mllib was developed at uc berkeley by 11 contributors, and it provided a limited set of standard machine learning methods. however, mllib is currently experiencing dramatic growth, and it has over 140 contributors from over 50 organisations. • mlbase [23] . the mlbase platform consists of three layers: ml optimizer, mllib, and mli. ml optimizer (currently under development) aims to automate the task of ml pipeline construction. the optimizer solves a search problem over the feature extractors and ml algorithms included in mli and mllib. mli [24] is an experimental api for feature extraction and algorithm development that introduces high-level ml programming abstractions. a prototype of mli has been implemented against spark and serves as a test bed for mllib. finally, mllib is apache sparks distributed ml library. mllib was initially developed as part of the mlbase project, and the library is currently supported by the spark community. pentaho is an open source data integration (kettle) tool that delivers powerful extraction, transformation, and loading capabilities using a groundbreaking, metadata-driven approach. it also provides analytics, reporting, visualisation, and a predictive analytics framework that is directly designed to work with hadoop nodes. it provides data integration and analytic platforms based on hadoop in which datasets can be streamed, blended, and then automatically published into one of the popular analytic databases. • sparkr. there is an important number of r-based applications for mapreduce and other big data applications. r [39] is a popular and extremely powerful programming language for statistics and data analysis. sparkr provides an r frontend for spark. it allows users to interactively run jobs from the r shell on a cluster, automatically serializes the necessary variables to execute a function on the cluster, and also allows for easy use of existing r packages. social big data analytic can be seen as the set of algorithms and methods used to extract relevant knowledge from social media data sources that could provide heterogeneous contents, with very large size, and constantly changing (stream or online data). this is inherently interdisciplinary and spans areas such as data mining, machine learning, statistics, graph mining, information retrieval, and natural language among others. this section provides a description of the basic methods and algorithms related to network analytics, community detection, text analysis, information diffusion, and information fusion, which are the areas currently used to analyse and process information from social-based sources. today, society lives in a connected world in which communication networks are intertwined with daily life. for example, social networks are one of the most important sources of social big data; specifically, twitter generates over 400 million tweets every day [40] . in social networks, individuals interact with one another and provide information on their preferences and relationships, and these networks have become important tools for collective intelligence extraction. these connected networks can be represented using graphs, and network analytic methods [41] can be applied to them for extracting useful knowledge. graphs are structures formed by a set of vertices (also called nodes) and a set of edges, which are connections between pairs of vertices. the information extracted from a social network can be easily represented as a graph in which the vertices or nodes represent the users and the edges represent the relationships among them (e.g., a re-tweet of a message or a favourite mark in twitter). a number of network metrics can be used to perform social analysis of these networks. usually, the importance, or influence, in a social network is analysed through centrality measures. these measures have high computational complexity in large-scale networks. to solve this problem, focusing on a large-scale graph analysis, a second generation of frameworks based on the mapreduce paradigm has appeared, including hama, giraph (based on pregel), and graphlab among others [42] . pregel [37] is a graph-parallel system based on the bulk synchronous parallel model (bsp) [43] . a bsp abstract computer can be interpreted as a set of processors that can follow different threads of computation in which each processor is equipped with fast local memory and interconnected by a communication network. according to this, the platform based on this model comprises three major components: • components capable of processing and/or local memory transactions (i.e., processors). • a network that routes messages between pairs of these components. • a hardware facility that allows for the synchronisation of all or a subset of components. taking into account this model, a bsp algorithm is a sequence of global supersteps that consists of three components: 1. concurrent computation: every participating processor may perform local asynchronous computations. 2. communication: the processes exchange data from one processor to another, facilitating remote data storage capabilities. 3. barrier synchronisation: when a process reaches this point (the barrier), it waits until all other processes have reached the same barrier. hama [44] and giraph are two distributed graph processing frameworks on hadoop that implement pregel. the main difference between the two frameworks is the matrix computation using the mapreduce paradigm. apache giraph is an iterative graph processing system in which the input is a graph composed of vertices and directed edges. computation proceeds as a sequence of iterations (supersteps). initially, every vertex is active, and for each superstep, every active vertex invokes the "compute method" that will implement the graph algorithm that will be executed. this means that the algorithms implemented using giraph are vertex oriented. apache hama does not only allow users to work with pregel-like graph applications. this computing engine can also be used to perform computeintensive general scientific applications and machine learning algorithms. moreover, it currently supports yarn, which is the resource management technology that lets multiple computing frameworks run on the same hadoop cluster using the same underlying storage. therefore, the same data could be analysed using mapreduce or spark. in contrast, graphlab is based on a different concept. whereas pregel is a one-vertex-centric model, this framework uses vertexto-node mapping in which each vertex can access the state of adjacent vertices. in pregel, the interval between two supersteps is defined by the run time of the vertex with the largest neighbourhood. the graphlab approach improves this splitting of vertices with large neighbourhoods across different machines and synchronises them. finally, elser and montresor [42] present a study of these data frameworks and their application to graph algorithms. the k-core decomposition algorithm is adapted to each framework. the goal of this algorithm is to compute the centrality of each node in a given graph. the results obtained confirm the improvement achieved in terms of execution time for these frameworks based on hadoop. however, from a programming paradigm point of view, the authors recommend pregel-inspired frameworks (a vertex-centric framework), which is the better fit for graph-related problems. the community detection problem in complex networks has been the subject of many studies in the field of data mining and social network analysis. the goal of the community detection problem is similar to the idea of graph partitioning in graph theory [45, 46] . a cluster in a graph can be easily mapped into a community. despite the ambiguity of the community definition, numerous techniques have been used for detecting communities. random walks, spectral clustering, modularity maximization, and statistical mechanics have all been applied to detecting communities [46] . these algorithms are typically based on the topology information from the graph or network. related to graph connectivity, each cluster should be connected; that is, there should be multiple paths that connect each pair of vertices within the cluster. it is generally accepted that a subset of vertices forms a good cluster if the induced sub-graph is dense and there are few connections from the included vertices to the rest of the graph [47] . considering both connectivity and density, a possible definition of a graph cluster could be a connected component or a maximal clique [48] . this is a sub-graph into which no vertex can be added without losing the clique property. one of the most well-known algorithms for community detection was proposed by girvan and newman [49] . this method uses a new similarity measure called "edge betweenness" based on the number of the shortest paths between all vertex pairs. the proposed algorithm is based on identifying the edges that lie between communities and their successive removal, achieving the isolation of the communities. the main disadvantage of this algorithm is its high computational complexity with very large networks. modularity is the most used and best known quality measure for graph clustering techniques, but its computation is an np-complete problem. however, there are currently a number of algorithms based on good approximations of modularity that are able to detect communities in a reasonable time. the first greedy technique to maximize modularity was a method proposed by newman [50] . this was an agglomerative hierarchical clustering algorithm in which groups of vertices were successively joined to form larger communities such that modularity increased after the merging. the update of the matrix in the newman algorithm involved a large number of useless operations owing to the sparseness of the adjacency matrix. however, the algorithm was improved by clauset et al. [51] , who used the matrix of modularity variations to arrange for the algorithm to perform more efficiently. despite the improvements to and modifications of the accuracy of these greedy algorithms, they have poor performance when they are compared against other techniques. for this reason, newman reformulated the modularity measure in terms of eigenvectors by replacing the laplacian matrix with the modularity matrix [52] , called the spectral optimization of modularity. this improvement must also be applied in order to improve the results of other optimization techniques [53, 54] . random walks can also be useful for finding communities. if a graph has a strong community structure, a random walker spends a long time inside a community because of the high density of internal edges and the consequent number of paths that could be followed. zhou and lipowsky [55] , based on the fact that walkers move preferentially towards vertices that share a large number of neighbours, defined a proximity index that indicates how close a pair of vertices is to all other vertices. communities are detected with a procedure called netwalk, which is an agglomerative hierarchical clustering method by which the similarity between vertices is expressed by their proximity. a number of these techniques are focused on finding disjointed communities. the network is partitioned into dense regions in which nodes have more connections to each other than to the rest of the network, but it is interesting that in some domains, a vertex could belong to several clusters. for instance, it is well-known that people in a social network for natural memberships in multiple communities. therefore, the overlap is a significant feature of many realworld networks. to solve this problem, fuzzy clustering algorithms applied to graphs [56] and overlapping approaches [57] have been proposed. xie et al. [58] reviewed the state of the art in overlapping community detection algorithms. this work noticed that for low overlapping density networks, slpa, oslom, game, and copra offer better performance. for networks with high overlapping density and high overlapping diversity, both slpa and game provide relatively stable performance. however, test results also suggested that the detection in such networks is still not yet fully resolved . a common feature that is observed by various algorithms in real-world networks is the relatively small fraction of overlapping nodes (typically less than 30%), each of which belongs to only 2 or 3 communities. a significant portion of the unstructured content collected from social media is text. text mining techniques can be applied for automatic organization, navigation, retrieval, and summary of huge volumes of text documents [59] [60] [61] . this concept covers a number of topics and algorithms for text analysis including natural language processing (nlp), information retrieval, data mining, and machine learning [62] . information extraction techniques attempt to extract entities and their relationships from texts, allowing for the inference of new meaningful knowledge. these kinds of techniques are the starting point for a number of text mining algorithms. a usual model for representing the content of documents or text is the vector space model. in this model, each document is represented by a vector of frequencies of remaining terms within the document [60] . the term frequency (tf) is a function that relates the number of occurrences of the particular word in the document divided by the number of words in the entire document. another function that is currently used is the inverse document frequency (idf); typically, documents are represented as tf-idf feature vectors. using this data representation, a document represents a data point in n-dimensional space where n is the size of the corpus vocabulary. text data tend to be sparse and high dimensional. a text document corpus can be represented as a large sparse tf-idf matrix, and applying dimensionality reduction methods to represent the data in compressed format [63] can be very useful. latent semantic indexing [64] is an automatic indexing method that projects both documents and terms into a low-dimensional space that attempts to represent the semantic concepts in the document. this method is based on the singular value decomposition of the term-document matrix, which constructs a low-ranking approximation of the original matrix while preserving the similarity between the documents. another family of dimension reduction techniques is based on probabilistic topic mod-els such as latent dirichlet allocation (lda) [65] . this technique provides the mechanism for identifying patterns of term co-occurrence and using those patterns to identify coherent topics. standard lda implementations of the algorithm read the documents of the training corpus numerous times and in a serial way. however, new, efficient, parallel implementations of this algorithm have appeared [66] in attempts to improve its efficiency. unsupervised machine learning methods can be applied to any text data without the need for a previous manual process. specifically, clustering techniques are widely studied in this domain to find hidden information or patterns in text datasets. these techniques can automatically organise a document corpus into clusters or similar groups based on a blind search in an unlabelled data collection, grouping the data with similar properties into clusters without human supervision. generally, document clustering methods can be mainly categorized into two types [67] : partitioning algorithms that divide a document corpus into a given number of disjoint clusters that are optimal in terms of some predefined criteria functions [68] and hierarchical algorithms that group the data points into a hierarchical tree structure or a dendrogram [69] . both types of clustering algorithms have strengths and weaknesses depending on the structure and characteristics of the dataset used. in zhao and karypis [70] , a comparative assessment of different clustering algorithms (partitioning and hierarchical) was performed using different similarity measures on high-dimensional text data. the study showed that partitioning algorithms perform better and can also be used to produce hierarchies of higher quality than those returned by the hierarchical ones. in contrast, the classification problem is one of the main topics in the supervised machine learning literature. nearly all of the wellknown techniques for classification, such as decision trees, association rules, bayes methods, nearest neighbour classifiers, svm classifiers, and neural networks, have been extended for automated text categorisation [71] . sentiment classification has been studied extensively in the area of opinion mining research, and this problem can be formulated as a classification problem with three classes, positive, negative and neutral. therefore, most of the existing techniques designed for this purpose are based on classifiers [72] . however, the emergence of social networks has created massive and continuous streams of text data. therefore, new challenges have been arising in adapting the classic machine learning methods, because of the need to process these data in the context of a one-pass constraint [73] . this means that it is necessary to perform data mining tasks online and only one time as the data come in. for example, the online spherical k-means algorithm [74] is a segment-wise approach that was proposed for streaming text clustering. this technique splits the incoming text stream into small segments that can be processed effectively in memory. then, a set of k-means iterations is applied to each segment in order to cluster them. moreover, in order to consider less important old documents during the clustering process, a decay factor is included. one of the most important roles of social media is to spread information to social links. with the large amount of data and the complex structures of social networks, it has been even more difficult to understand how (and why) information is spread by social reactions (e.g., retweeting in twitter and like in facebook). it can be applied to various applications, e.g., viral marketing, popular topic detection, and virus prevention [75] . as a result, many studies have been proposed for modelling the information diffusion patterns on social networks. the characteristics of the diffusion models are (i) the topological structure of the network (a sub-graph composed of a set of users to whom the information has been spread) and (ii) temporal dynamics (the evolution of the number of users whom the information has reached over time) [76] . according to the analytics, these diffusion models can be categorized into explanatory and predictive approaches [77] . • explanatory models: the aim of these models is to discover the hidden spreading cascades once the activation sequences are collected. these models can build a path that can help users to easily understand how the information has been diffused. the netint method [78] has applied sub-modular, function-based iterative optimisation to discover the spreading cascade (path) that maximises the likelihood of the collected dataset. in particular, for working with missing data, a k-tree model [79] has been proposed to estimate the complete activation sequences. • predictive models: these are based on learning processes with the observed diffusion patterns. depending on the previous diffusion patterns, there are two main categories of predictive models: (i) structure-based models (graph-based approaches) and (ii) content-analysis-based models (non-graph-based approaches). moreover, there are more existing approaches to understanding information diffusion patterns. the projected greedy approach for non-sub-modular problems [80] was recently proposed to populate the useful seeds in social networks. this approach can identify the partial optimisation for understanding the information diffusion. additionally, an evolutionary dynamics model was presented in [ [81] , [82] ] that attempted to understand the temporal dynamics of information diffusion over time. one of the relevant topics for analysing information diffusion patterns and models is the concept of time and how it can be represented and managed. one of the popular approaches is based on time series. any time series can be defined as a chronological collection of observations or events. the main characteristics of this type of data are large size, high dimensionality, and continuous change. in the context of data mining, the main problem is how to represent the data. an effective mechanism for compressing the vast amount of time series data is needed in the context of information diffusion. based on this representation, different data mining techniques can be applied such as pattern discovery, classification, rule discovery, and summarisation [83] . in lin et al. [84] , a new symbolic representation of time series is proposed that allows for a dimensionality/numerosity reduction. this representation is tested using different classic data mining tasks such as clustering, classification, query by content, and anomaly detection. based on the mathematical models mentioned above, we need to compare a number of various applications that can support users in many different domains. one of the most promising applications is detecting meaningful social events and popular topics in society. such meaningful events and topics can be discovered by well-known text processing schemes (e.g., tf-idf) and simple statistical approaches (e.g., lda, gibbs sampling, and the tste method [85] ). in particular, not only the time domain but also the frequency domain have been exploited to identify the most frequent events [86] . the social big data from various sources needs to be fused for providing users with better services. these fusion can be done in different ways and affect to different technologies, methods and even research areas. two of these possible areas are ontologies and social networks, next how previous areas could benefit from information fusion in social big data are briefly described: • ontology-based fusion. semantic heterogeneity is an important issue on information fusion. social networks have inherently different semantics from other types of network. such semantic heterogeneity includes not only linguistic differences (e.g., between 'reference' and 'bibliography') but also mismatching between conceptual structures. to deal with these problems, in [87] ontologies are exploited from multiple social networks, and more importantly, semantic correspondences obtained by ontology matching methods. more practically, semantic meshup applications have been illustrated. to remedy the data integration issues of the traditional web mashups, the semantic technologies uses the linked open data (lod) based on rdf data model, as the unified data model for combining, aggregating, and transforming data from heterogeneous data resources to build linked data mashups [88] . • social network integration. next issue is how to integrate the distributed social networks. as many kinds of social networking services have been developed, users are joining multiple services for social interactions with other users and collecting a large amount of information (e.g., statuses on facebook and tweets on twitter). an interesting framework has been proposed for a social identity matching (sim) method across these multiple sns [89] . it means that the proposed approach can protect user privacy, because only the public information (e.g., username and the social relationships of the users) is employed to find the best matches between social identities. particularly, cloud-based platform has been applied to build software infrastructure where the social network information can be shared and exchanged [90] . the social big data analysis can be applied to social media data sources for discovering relevant knowledge that can be used to improve the decision making of individual users and companies [18] . in this context, business intelligence can be defined as those techniques, systems, methodologies, and applications that analyse critical business data to help an enterprise better understand its business and market and to support business decisions [91] . this field includes methodologies that can be applied to different areas such as e-commerce, marketing, security, and healthcare [18] ; more recent methodologies have been applied to treat social big data. this section provides short descriptions of some applications of these methodologies in domains that intensively use social big data sources for business intelligence. marketing researchers believe that big social media analytics and cloud computing offer a unique opportunity for businesses to obtain opinions from a vast number of customers, improving traditional strategies. a significant market transformation has been accomplished by leading e-commerce enterprises such amazon and ebay through their innovative and highly scalable e-commerce platforms and recommender systems. social network analysis extracts user intelligence and can provide firms with the opportunity for generating more targeted advertising and marketing campaigns. maurer and wiegmann [92] show an analysis of advertising effectiveness on social networks. in particular, they carried out a case study using facebook to determine users perceptions regarding facebook ads. the authors found that most of the participants perceived the ads on facebook as annoying or not helpful for their purchase decisions. however, trattner and kappe [93] show how ads placed on users social streams that have been generated by the facebook tools and applications can increase the number of visitors and the profit and roi of a web-based platform. in addition, the authors present an analysis of real-time measures to detect the most valuable users on facebook. a study of microblogging (twitter) utilization as an ewom (electronic word-of-mouth) advertising mechanism is carried out in jansen et al. [94] . this work analyses the range, frequency, timing, and table 1 basic features related to social big data applications in marketing area. trattner and kappe [93] targeted advertising on facebook real-time measures to detect the most valuable users jansen et al. [94] twitter as ewom advertising mechanism sentiment analysis asur et al. [95] using twitter to forecast box-office revenues for movies topics detection, sentiment analysis ma et al. [96] viral marketing in social networks social network analysis, information diffusion models content of tweets in various corporate accounts. the results obtained show that 19% of microblogs mention a brand. of the branding microblogs, nearly 20% contained some expression of brand sentiments. therefore, the authors conclude that microblogging reports what customers really feel about the brand and its competitors in real time, and it is a potential advantage to explore it as part of companies overall marketing strategies. customers brand perceptions and purchasing decisions are increasingly influenced by social media services, and these offer new opportunities to build brand relationships with potential customers. another approach that uses twitter data is presented in asur et al. [95] to forecast box-office revenues for movies. the authors show how a simple model built from the rate at which tweets are created about particular topics can outperform marketbased predictors. moreover, the sentiment extraction from twitter is used to improve the forecasting power of social media. because of the exponential growth use of social networks, researchers are actively attempting to model the dynamics of viral marketing based on the information diffusion process. ma et al. [96] proposed modelling social network marketing using heat diffusion processes. heat diffusion is a physical phenomenon related to heat, which always flows from a position with higher temperature to a position with lower temperature. the authors present three diffusion models along with three algorithms for selecting the best individuals to receive marketing samples. these models can diffuse both positive and negative comments on products or brands in order to simulate the real opinions within social networks. moreover, the authors complexity analysis shows that the model is also scalable to large social networks. table 1 shows a brief summary of the previously described applications, including the basic functionalities for each and their basic methods. criminals tend to have repetitive pattern behaviours, and these behaviours are dependent upon situational factors. that is, crime will be concentrated in environments with features that facilitate criminal activities [97] . the purpose of crime data analysis is to identify these crime patterns, allowing for detecting and discovering crimes and their relationships with criminals. the knowledge extracted from applying data mining techniques can be very useful in supporting law enforcement agencies. communication between citizens and government agencies is mostly through telephones, face-to-face meetings, email, and other digital forms. most of these communications are saved or transformed into written text and then archived in a digital format, which has led to opportunities for automatic text analysis using nlp techniques to improve the effectiveness of law enforcement [98] . a decision support system that combines the use of nlp techniques, similarity measures, and classification approaches is proposed by ku and leroy [99] to automate and facilitate crime analysis. filtering reports and identifying those that are related to the same or similar crimes can provide useful information to analyse crime trends, which allows for apprehending suspects and improving crime prevention. traditional crime data analysis techniques are typically designed to handle one particular type of dataset and often overlook geospatial distribution. geographic knowledge discovery can be used to discover patterns of criminal behaviour that may help in detecting where, when, and why particular crimes are likely to occur. based on this concept, phillips and lee [100] present a crime data analysis technique that allows for discovering co-distribution patterns between large, aggregated and heterogeneous datasets. in this approach, aggregated datasets are modelled as graphs that store the geospatial distribution of crime within given regions, and then these graphs are used to discover datasets that show similar geospatial distribution characteristics. the experimental results obtained in this work show that it is possible to discover geospatial co-distribution relationships among crime incidents and socio-economic, socio-demographic and spatial features. another analytical technique that is now in high use by law enforcement agencies to visually identify where crime tends to be highest is the hotspot mapping. this technique is used to predict where crime may happen, using data from the past to inform future actions. each crime event is represented as a point, allowing for the geographic distribution analysis of these points. a number of mapping techniques can be used to identify crime hotspots, such as: point mapping, thematic mapping of geographic areas, spatial ellipses, grid thematic mapping, and kernel density estimation (kde), among others. chainey et al. [101] conducted a comparative assessment of these techniques, and the results obtained showed that kde was the technique that consistently outperformed the others. moreover, the authors offered a benchmark to compare with the results of other techniques and other crime types, including comparisons between advanced spatial analysis techniques and prediction mapping methods. another novel approach using spatio-temporally tagged tweets for crime prediction is presented by gerber [102] . this work shows the use of twitter, applying a linguistic analysis and statistical topic modelling to automatically identify discussion topics across a city in the united states. the experimental results showed that adding twitter data improved crime prediction performance versus a standard approach based on kde. finally, the use of data mining in fraud detection is very popular, and there are numerous studies on this area. atm phone scams are one well-known type of fraud. kirkos et al. [103] analysed the effectiveness of data mining classification techniques (decision trees, neural networks and bayesian belief networks) for identifying fraudulent financial statements, and the experimental results concluded that bayesian belief networks provided higher accuracy for fraud classification. another approach to detecting fraud in real-time credit card transactions was presented by quah and sriganesh [104] . the system these authors proposed uses a self-organization map to filter and analyse customer behaviour to detect fraud. the main idea is to detect the patterns of the legal cardholder and of the fraudulent transactions through neural network learning and then to develop rules for these two different behaviours. one typical fraud in this area is the atm phone scams that attempts to transfer a victims money into fraudulent accounts. in order to identify the signs of fraudulent accounts and the patterns of fraudulent transactions, li et al. [105] applied bayesian classification and association rules. detection rules are developed based on the identified signs and applied to the design of a fraudulent account detection system. table 2 shows a brief summary of all of the applications that were previously mentioned, providing a description of the basic functionalities of each and their main methods. [99] technique to discover geospatial co-distribution relations among crime incidents network analysis chainey et al. [101] comparative assessment of mapping techniques to predict where crimes may happen spatial analysis, mapping methods gerber [102] identify discussion topics across a city in the united states to predict crimes linguistic analysis, statistical topic modelling kirkos et al. [103] identification of fraudulent financial statements classification (decision trees, neural networks and bayesian belief networks) quah and sriganesh [104] detect fraud detection in real-time credit card transactions neural network learning, association rules li et al. [105] identify the signs of fraudulent accounts and the patterns of fraudulent transactions bayesian classification, association rules epidemic intelligence can be defined as the early identification, assessment, and verification of potential public health risks [106] and the timely dissemination of the appropriate alerts. this discipline includes surveillance techniques for the automated and continuous analysis of unstructured free text or media information available on the web from social networks, blogs, digital news media, and official sources. text mining techniques have been applied to biomedical text corpora for named entity recognition, text classification, terminology extraction, and relationship extraction [107] . these methods are human language processing algorithms that aim to convert unstructured textual data from large-scale collections to a specific format, filtering them according to need. they can be used to detect words related to diseases or their symptoms in published texts [108] . however, this can be difficult because the same word can refer to different things depending upon context. furthermore, a specific disease can have multiple associated names and symptoms, which increases the complexity of the problem. ontologies can help to automate human understanding of key concepts and the relationships between them, and they allow for achieving a certain level of filtering accuracy. in the health domain, it is necessary to identify and link term classes such as diseases, symptoms, and species in order to detect the potential focus of disease outbreaks. currently, there are a number of available biomedical ontologies that contain all of the necessary terms. for example, the biocaster ontology [109] is based on the owl semantic web language, and it was designed to support automated reasoning across terms in 12 languages. the increasing popularity and use of microblogging services such as twitter are recently a new valuable data source for web-based surveillance because of their message volume and frequency. twitter users may post about an illness, and their relationships in the network can give us information about whom they could be in contact with. furthermore, user posts retrieved from the public twitter api can come with gps-based location tags, which can be used to locate the potential centre of disease outbreaks. a number of works have already appeared that show the potential of twitter messages to track and predict outbreaks. a document classifier to identify relevant messages was presented in culotta [110] . in this work, twitter messages related to the flu were gathered, and then a number of classification systems based on different regression models to correlate these messages with cdc statistics were compared; the study found that the best model had a correlation of 0.78 (simple model regression). aramaki [111] presented a comparative study of various machinelearning methods to classify tweets related to influenza into two categories: positive and negative. their experimental results showed that the svm model that used polynomial kernels achieved the highest accuracy (fmeasure of 0.756) and the lowest training time. well-known regression models were evaluated on their ability to assess disease outbreaks from tweets in bodnar and salathé [112] . regression methods such as linear, multivariable, and svm were applied to the raw count of tweets that contained at least one of the keywords related to a specific disease, in this case "flu". the models also validated that even using irrelevant tweets and randomly generated datasets, regression methods were able to assess disease levels comparatively well. a new unsupervised machine learning approach to detect public health events was proposed in fisichella et al. [113] that can complement existing systems because it allows for identifying public health events even if no matching keywords or linguistic patterns can be found. this new approach defined a generative model for predictive event detection from documents by modelling the features based on trajectory distributions. however, in recent years, a number of surveillance systems have appeared that apply these social mining techniques and that have been widely used by public health organizations such as the world health organization (who) and the european centre for disease prevention and control [114] . tracking and monitoring mechanisms for early detection are critical in reducing the impact of epidemics through rapid responses. one of the earliest surveillance systems is the global public health intelligence network (gphin) [115] developed by the public health agency of canada in collaboration with the who. it is a secure, web-based, multilingual warning tool that continuously monitors and analyses global media data sources to identify information about disease outbreaks and other events related to public healthcare. the information is filtered for relevance by an automated process and is then analysed by public health agency of canada gphin officials. from 2002 to 2003, this surveillance system was able to detect the outbreak of sars (severe acute respiratory syndrome). from the biocaster ontology in 2006 arose the biocaster system [116] for monitoring online media data. the system continuously analyses documents reported from over 1700 rss feeds, google news, who, promed-mail, and the european media monitor, among other data sources. the extracted text is classified based on its topical relevance and plotted onto a google map using geo-information. the system has four main stages: topic classification, named entity recognition, disease/location detection, and event recognition. in the first stage, the texts are classified as relevant or non-relevant using a naive bayes classifier. then, for the relevant document corpora, entities of interest from 18 concept types based on the ontology related to diseases, viruses, bacteria, locations, and symptoms are searched. healthmap project [117] is a global disease alert map that uses data from different sources such as google news, expert-curated discussions such as promed-mail, and official organization reports such as those from the who or euro surveillance, an automated real-time system that monitors, organises, integrates, filters, visualises, and disseminates online information about emerging diseases. another system that collects news from the web related to human and animal health and that plots the data on google maps is epispider [118] . this tool automatically extracts information on infectious disease outbreaks from multiple sources including promedmail and medical web sites, and it is used as a surveillance system by table 3 basic features related to social big data applications in health care area. ref. num. summary methods culotta [110] track and predict outbreak detection using twitter classification (regression models) aramaki et al. [111] classify tweets related to influenza classification bodnar and salathé [112] assess disease outbreaks from tweets regression methods fisichella et al. [113] detect public health events modelling trajectory distributions gphin [115] identify information about disease outbreaks and other events related to public healthcare classification documents for relevance biocaster [116] monitoring online media data related to diseases, viruses, bacteria, locations and symptoms topic classification, named entity recognition, event recognition healthmap [117] global disease alert map mapping techniques epispider [118] human and animal disease alert map topic and location detection [126] collecting user experiences into a continually growing and adapting multimedia diary. classification of patterns in sensor readings from a camera, microphone, and accelerometers many eyes [127] creating visualisations in collaborative environment from upload data sets visualisation layout algorithms tweetpulse [128] building social pulse by aggregating identical user experiences visualising temporal dynamics of the thematic events public healthcare organizations, a number of universities, and health research organizations. additionally, this system automatically converts the topic and location information from the reports into rss feeds. finally, lyon et al. [119] conducted a comparative assessment of these three systems (biocaster, epispider, and healthmap) related to their ability to gather and analyse information that is relevant to public health. epispider obtained more relevant documents in this study. however, depending on the language of each system, the ability to acquire relevant information from different countries differed significantly. for instance, biocaster gives special priority to languages from the asia-pacific region, and epispider only considers documents written in english. table 3 shows a summary of the previous applications and their related functionalities and methods. big data from social media needs to be visualised for better user experiences and services. for example, the large volume of numerical data (usually in tabular form) can be transformed into different formats. consequently, user understandability can be increased. the capability of supporting timely decisions based on visualising such big data is essential to various domains, e.g., business success, clinical treatments, cyber and national security, and disaster management [120] . thus, user-experience-based visualisation has been regarded as important for supporting decision makers in making better decisions. more particularly, visualisation is also regarded as a crucial data analytic tool for social media [121] . it is important for understanding users needs in social networking services. there have been many visualisation approaches to collecting (and improving) user experiences. one of the most well-known is interactive data analytics. based on a set of features of the given big data, users can interact with the visualisation-based analytics system. such systems are r-based software packages [122] and ggobi [123] . moreover, some systems have been developed using statistical inferences. a bayesian inference scheme-based multi-input/multi-output (mimo) system [124] has been developed for better visualisation. we can also consider life-logging services that record all user experiences [125] , which is also known as quantify-self. various sensors can capture continuous physiological data (e.g., mood, arousal, and blood oxygen levels) together with user activities. in this context, life caching has been presented as a collaborative social action of storing and sharing users life events in an open environment. more practically, this collaborative user experience has been applied to gaming to encourage users. systems such as insense [126] are based on wearable devices and can collect users experiences into a continually growing and adapting multimedia diary. the insense system uses the patterns in sensor readings from a camera, a microphone, and accelerometers to classify the users activities and automatically collect multimedia clips when the user is in an interesting situation. moreover, visualisation systems such as many eyes [127] have been designed to upload datasets and create visualisations in collaborative environments, allowing users to upload data, create visualisation of that data, and leave comments on both the visualisation and the data, providing a medium to foment discussion among users. many eyes is designed for ordinary people and does not require any extensive training or prior knowledge to take full advantage of its functionalities. other visual analytics tools have shown some graphical visualisations for supporting efficient analytics of the given big data. particularly, tweetpulse [128] has built social pulses by aggregating identical user experiences in social networks (e.g., twitter), and visualised temporal dynamics of the thematic events. finally, table 4 provides a summary of those applications related to the methods used for visualisation based on user experiences. with the large number and rapid growth of social media systems and applications, social big data has become an important topic in a broad array of research areas. the aim of this study has been to provide a holistic view and insights for potentially helping to find the most relevant solutions that are currently available for managing knowledge in social media. as such, we have investigated the state-of-the-art technologies and applications for processing the big data from social media. these technologies and applications were discussed in the following aspects: (i) what are the main methodologies and technologies that are available for gathering, storing, processing, and analysing big data from social media? (ii) how does one analyse social big data to discover meaningful patterns? and (iii) how can these patterns be exploited as smart, useful user services through the currently deployed examples in social-based applications? more practically, this survey paper shows and describes a number of existing systems (e.g., frameworks, libraries, software applications) that have been developed and that are currently being used in various domains and applications based on social media. the paper has avoided describing or analysing those straightforward applications such as facebook and twitter that currently intensively use big data technologies, instead focusing on other applications (such as those related to marketing, crime analysis, or epidemic intelligence) that could be of interest to potential readers. although it is extremely difficult to predict which of the different issues studied in this work will be the next "trending topic" in social big data research, from among all of the problems and topics that are currently under study in different areas, we selected some "open topics" related to privacy issues, streaming and online algorithms, and data fusion visualisation, providing some insights and possible future trends. in the era of online big data and social media, protecting the privacy of the users on social media has been regarded as an important issue. ironically, as the analytics introduced in this paper become more advanced, the risk of privacy leakage is growing. as such, many privacy-preserving studies have been proposed to address privacy-related issues. we can note that there are two main well-known approaches. the first one is to exploit "k-anonymity", which is a property possessed by certain anonymised data [129] . given the private data and a set of specific fields, the system (or service) has to make the data practically useful without identifying the individuals who are the subjects of the data. the second approach is "differential privacy", which can provide an efficient way to maximise the accuracy of queries from statistical databases while minimising the chances of identifying its records [130] . however, there are still open issues related to privacy. social identification is the important issue when social data are merged from available sources, and secure data communication and graph matching are potential research areas [89] . the second issue is evaluation. it is not easy to evaluate and test privacy-preserving services with real data. therefore, it would be particularly interesting in the future to consider how to build useful benchmark datasets for evaluation. moreover, we have to consider this data privacy issues in many other research areas. in the context of law (also, international law) enforcement, data privacy must be prevented from any illegal usages, whereas governments tend to trump the user privacy for the purpose of national securities. also, developing educational program for technicians (also, students) is important [131] . it is still open issue on how (and what) to design the curriculum for the data privacy. one of the current main challenges in data mining related to big data problems is to find adequate approaches to analysing massive amounts of online data (or data streams). because classification methods require previous labelling, these methods also require great effort for real-time analysis. however, because unsupervised techniques do not need this previous process, clustering has become a promising field for real-time analysis, especially when these data come from social media sources. when data streams are analysed, it is important to consider the analysis goal in order to determine the best type of algorithm to be used. we were able to divide data stream analysis into two main categories: • offline analysis: we consider a portion of data (usually large data) and apply an offline clustering algorithm to analyse the data. • online analysis: the data are analysed in real time. these kinds of algorithms are constantly receiving new data and are not usually able to keep past information. a new generation of online [132, 133] and streaming [134, 135] algorithms is currently being developed in order to manage social big data challenges, and these algorithms require high scalability in both memory consumption [136] and time computation. some new developments related to traditional clustering algorithms, such as the k-mean [137] , em [138] , which has been modified to work with the mapreduce paradigm, and more sophisticated approaches based on graph computing (such as spectral clustering), are currently being developed [139] [140] [141] into more efficient versions from the state-of-theart algorithms [142, 143] . finally, data fusion and data visualisation are two clear challenges in social big data. although both areas have been intensively studied with regard to large, distributed, heterogeneous, and streaming data fusion [144] and data visualisation and analysis [145] , the current, rapid evolution of social media sources jointly with big data technologies creates some particularly interesting challenges related to: • obtaining more reliable methods for fusing the multiple features of multimedia objects for social media applications [146] . • studying the dynamics of individual and group behaviour, characterising patterns of information diffusion, and identifying influential individuals in social networks and other social media-based applications [147] . • identifying events [148] in social media documents via clustering and using similarity metric learning approaches to produce highquality clustering results [149] . • the open problems and challenges related to visual analytics [145] , especially related to the capacity to collect and store new data, are rapidly increasing in number, including the ability to analyse these data volumes [150] , to record data about the movement of people and objects at a large scale [151] , and to analyse spatio-temporal data and solve spatio-temporal problems in social media [152] , among others. ibm, big data and analytics the data explosion in 2014 minute by minute data mining with big data analytics over large-scale multidimensional data: the big data revolution! ad949-3d-data-management-controlling-data-volume-velocity-and-variety.pdf the importance of 'big data': a definition, gartner, stamford the rise of big data on cloud computing: review and open research issues an overview of the open science data cloud big data: survey, technologies, opportunities, and challenges media, society, world: social theory and digital media practice who interacts on the web?: the intersection of users' personality and social media use users of the world, unite! the challenges and opportunities of social media the role of social media in higher education classes (real and virtual)-a literature review the dynamics of health behavior sentiments on a large online social network big social data analysis, big data comput trending: the promises and the challenges of big social data big data: issues and challenges moving forward business intelligence and analytics: from big data to big impact hadoop: the definitive spark: cluster computing with working sets mllib: machine learning in apache spark mlbase: a distributed machine-learning system mli: an api for distributed machine learning mapreduce: simplified data processing on large clusters mapreduce: simplified data processing on large clusters mapreduce algorithms for big data analysis improving mapreduce performance in heterogeneous environments proceedings of the 2013 acm sigmod international conference on management of data, sigmod '13 useful stuff the big-data ecosystem table clojure programming, o'really the chubby lock service for loosely-coupled distributed systems the stratosphere platform for big data analytics the google file system bigtable: a distributed storage system for structured data pregel: a system for large-scale graph processing mongodb: the definitive guide the r book, 1st twitter now seeing 400 million tweets per day, increased mobile ad revenue, says ceo an introduction to statistical methods and data analysis an evaluation study of bigdata frameworks for graph processing a bridging model for parallel computation hama: an efficient matrix computation with the mapreduce framework finding local community structure in networks community detection in graphs on clusterings-good, bad and spectral the maximum clique problem, in: handbook of combinatorial optimization community structure in social and biological networks fast algorithm for detecting community structure in networks finding community structure in very large networks modularity and community structure in networks spectral tri partitioning of networks a vector partitioning approach to detecting community structure in complex networks network brownian motion: a new method to measure vertex-vertex proximity and to identify communities and subcommunities a hierarchical clustering algorithm based on fuzzy graph connectedness adaptive k-means algorithm for overlapped graph clustering overlapping community detection in networks: the state-of-the-art and comparative study web document clustering: a feasibility demonstration information retrieval: data structures & algorithms introduction to information retrieval text analytics in social media principal component analysis indexing by latent semantic analysis latent dirichlet allocation proceedings of the 15th acm sigkdd international conference on knowledge discovery and data mining data clustering: a review fast and effective text mining using linear-time document clustering evaluation of hierarchical clustering algorithms for document datasets empirical and theoretical comparisons of selected criterion functions for document clustering machine learning in automated text categorization thumbs up?: sentiment classification using machine learning techniques data streams: models and algorithms efficient online spherical k-means clustering scalable influence maximization for prevalent viral marketing in large-scale social networks real-time event detection on social data stream information diffusion in online social networks: a survey inferring networks of diffusion and influence correcting for missing data in information cascades seeding influential nodes in nonsubmodular models of information diffusion graphical evolutionary game for information diffusion over social networks evolutionary dynamics of information diffusion over social networks a review on time series data mining a symbolic representation of time series, with implications for streaming algorithms emerging topic detection on twitter based on temporal and social terms evaluation privacy-preserving discovery of topic-based events from social sensor signals: an experimental study on twitter integrating social networks for context fusion in mobile service platforms semantic information integration with linked data mashups approaches privacy-aware framework for matching online social identities in multiple social networking services a social compute cloud: allocating and sharing infrastructure resources via social networks competing on analytics: the new science of winning effectiveness of advertising on social network sites: a case study on facebook social stream marketing on facebook: a case study twitter power: tweets as electronic word of mouth predicting the future with social media mining social networks using heat diffusion processes for marketing candidates selection opportunities for improving egovernment: using language technology in workflow management a decision support system: automated crime report analysis and classification for e-government mining co-distribution patterns for large crime datasets the utility of hotspot mapping for predicting spatial patterns of crime predicting crime using twitter and kernel density estimation data mining techniques for the detection of fraudulent financial statements real-time credit card fraud detection using computational intelligence identifying the signs of fraudulent accounts using data mining techniques epidemic intelligence: a new framework for strengthening disease surveillance in europe a survey of current work in biomedical text mining nowcasting events from the social web with statistical learning an ontology-driven system for detecting global health events towards detecting influenza epidemics by analyzing twitter messages twitter catches the flu: detecting influenza epidemics using twitter validating models for disease detection using twitter detecting health events on the social web to enable epidemic intelligence the landscape of international event-based biosurveillance the global public health intelligence network and early warning outbreak detection biocaster: detecting public health rumors with a web-based text mining system surveillance sans frontieres: internet-based emerging infectious disease intelligence and the healthmap project use of unstructured event-based reports for global infectious disease surveillance comparison of web-based biosecurity intelligence systems: biocaster, epispider and healthmap big-data visualization visualization of entities within social media: toward understanding users' needs parallelmcmccombine: an r package for bayesian methods for big data and analytics ggobi: evolving from xgobi into an extensible framework for interactive data visualization a visualization framework for real time decision making in a multi-input multi-output system insense: interest-based life logging manyeyes: a site for visualization at internet scale social data visualization system for understanding diffusion patterns on twitter: a case study on korean enterprises k-anonymity: a model for protecting privacy proceedings of 5th international conference on theory and applications of models of computation educating engineers: teaching privacy in a world of open doors online algorithms: the state of the art ultraconservative online algorithms for multiclass problems better streaming algorithms for clustering problems a survey on algorithms for mining frequent itemsets over data streams a multi-objective genetic graphbased clustering algorithm with memory optimization he, parallel k-means clustering based on mapreduce map-reduce for machine learning on multicore parallel spectral clustering in distributed systems a co-evolutionary multi-objective approach for a k-adaptive graph-based clustering algorithm gany: a genetic spectral-based clustering algorithm for large data analysis on spectral clustering: analysis and an algorithm learning spectral clustering, with application to speech separation dfuse: a framework for distributed data fusion visual analytics: definition, process, and challenges multiple feature fusion for social media applications the role of social networks in information diffusion event identification in social media learning similarity metrics for event identification in social media visual analytics visual analytics tools for analysis of movement data space, time and visual analytics this work has been supported by several research grants: co key: cord-206872-t6lr3g1m authors: huang, huawei; kong, wei; zhou, sicong; zheng, zibin; guo, song title: a survey of state-of-the-art on blockchains: theories, modelings, and tools date: 2020-07-07 journal: nan doi: nan sha: doc_id: 206872 cord_uid: t6lr3g1m to draw a roadmap of current research activities of the blockchain community, we first conduct a brief overview of state-of-the-art blockchain surveys published in the recent 5 years. we found that those surveys are basically studying the blockchain-based applications, such as blockchain-assisted internet of things (iot), business applications, security-enabled solutions, and many other applications in diverse fields. however, we think that a comprehensive survey towards the essentials of blockchains by exploiting the state-of-the-art theoretical modelings, analytic models, and useful experiment tools is still missing. to fill this gap, we perform a thorough survey by identifying and classifying the most recent high-quality research outputs that are closely related to the theoretical findings and essential mechanisms of blockchain systems and networks. several promising open issues are also summarized finally for future research directions. we wish this survey can serve as a useful guideline for researchers, engineers, and educators about the cutting-edge development of blockchains in the perspectives of theories, modelings, and tools. blockchains have been deeply diving into multiple applications that are closely related to every aspect of our daily life, such as cryptocurrencies, business applications, smart city, internet-of-things (iot) applications, and etc. in the following, before discussing the motivation of this survey, we first conduct a brief exposition of the state-of-the-art blockchain survey articles published in the recent few years. to identify the position of our survey, we first collect 66 state-of-the-art blockchain-related survey articles. the numbers of each category of those surveys are shown in fig. 1 . we see that the top-three popular topics of blockchain-related survey are iot & iiot, consensus protocols, and security & privacy. we also classify those existing surveys and their chronological distribution in fig. 2 , from which we discover that i) the number of surveys published in each year increases dramatically, and ii) the diversity of topics also becomes greater following the chronological order. in detail, we summarize the publication years, topics, and other metadata of these surveys in table 1 and table 2 . basically, those surveys can be classified into the following 7 groups, which are briefly reviewed as follows. 1.1.1 blockchain essentials. the first group is related to the essentials of the blockchain. a large number of consensus protocols, algorithms, and mechanisms have been reviewed and summarized in [1] [2] [3] [4] [5] [6] [7] [8] . for example, motivated by lack of a comprehensive literature review regarding the consensus protocols for blockchain networks, wang et al. [3] emphasized on both the system design and the incentive mechanism behind those distributed blockchain consensus protocols such as byzantine fault tolerant (bft)-based protocols and nakamoto protocols. from a game-theoretic viewpoint, the authors also studied how such consensus protocols affect the consensus participants in blockchain networks. during the surveys of smart contracts [9] [10] [11] , atzei et al. [9] paid their attention to the security vulnerabilities and programming pitfalls that could be incurred in ethereum smart contracts. dwivedi et al. [10] performed a systematic taxonomy on smart-contract languages, while zheng et al. [11] conducted a survey on the challenges, recent technical advances and typical platforms of smart contracts. sharding techniques are viewed as promising solutions to solving the scalability issue and low-performance problems of blockchains. several survey articles [12, 13] provide systematic reviews on sharding-based blockchain techniques. for example, wang et al. [12] focused on the general design flow and critical design challenges of sharding protocols. next, yu et al. [13] mainly discussed the intra-consensus security, atomicity of cross-shard transactions, and other advantages of sharding mechanisms. regarding scalability, chen et al. [14] analyzed the scalability technologies in terms of efficiency-improving and function-extension of blockchains, while zhou et al. [15] compared and classified the existing scalability solutions in manuscript submitted to acm roles for the performance, security, healthy conditions of blockchain systems and blockchain networks. for example, salah et al. [26] studied how blockchain technologies benefit key problems of ai. zheng et al. [27] proposed the concept of blockchain intelligence and pointed out the opportunities that both these two terms can benefit each other. next, chen et al. [28] discussed the privacy-preserving and secure design of machine learning when blockchain techniques are imported. liu et al. [29] identified the overview, opportunities, and applications when integrating blockchains and machine learning technologies in the context of communications and networking. recently, game theoretical solutions [30] have been reviewed when they are applied in blockchain security issues such as malicious attacks and selfish mining, as well as the resource allocation in the management of mining. both the advantages and disadvantages of game theoretical solutions and models were discussed. networking. first, park et al. [31] discussed how to take the advantages of blockchains in could computing with respect to security solutions. xiong et al. [32] then investigated how to facilitate blockchain applications in mobile iot and edge computing environments. yang et al. [33] identified various perspectives including motivations, frameworks, and functionalities when integrating blockchain with edge computing. nguyen et al. [34] presented a comprehensive survey when blockchain meets 5g networks and beyond. the authors focused on the opportunities that blockchain can bring for 5g technologies, which include cloud computing, mobile edge computing, sdn/nfv, network slicing, d2d communications, 5g services, and 5g iot applications. manuscript submitted to acm table 2 . taxonomy of existing blockchain-related surveys (part 2). category ref. year topic iot, iiot christidis [35] 2016 blockchains and smart contracts for iot ali [36] 2018 applications of blockchains in iot fernandez [37] 2018 usage of blockchain for iot kouicem [38] 2018 iot security panarello [39] 2018 integration of blockchain and iot dai [40] 2019 blockchain for iot wang [41] 2019 blockchain for iot nguyen [42] 2019 integration of blockchain and cloud of things restuccia [43] 2019 blockchain technology for iot cao [44] 2019 challenges in distributed consensus of iot park [45] 2020 blockchain technology for green iot lao [46] 2020 iot applications in blockchain systems alladi [47] 2019 blockchain applications in industry 4.0 and iiot zhang [48] 2019 5g beyond for iiot based on edge intelligence and blockchain uav alladi [49] 2020 blockchain-based uav applications group-6: lu [50] 2018 functions, applications and open issues of blockchain casino [51] 2019 current status, classification and open issues of blockchain apps agriculture bermeo [52] 2018 blockchain technology in agriculture ferrag [53] 2020 blockchain solutions to security and privacy for green agriculture sdn alharbi [54] 2020 deployment of blockchains for software defined networks business apps konst. [55] 2018 blockchain-based business applications smart city xie [56] 2019 blockchain technology applied in smart cities smart grids alladi [57] 2019 blockchain in use cases of smart grids aderibole [58] 2020 smart grids based on blockchain technology file systems huang [59] 2020 blockchain-based distributed file systems, ipfs, filecoin, etc. space industry torky [60] 2020 blockchain in space industry covid19 nguyen [61] 2020 combat covid-19 using blockchain and ai-based solutions yuan [62] 2016 the state of the art and future trends of blockchain general & outlook zheng [63] 2017 architecture, consensus, and future trends of blockchains zheng [64] 2018 challenges and opportunities of blockchain yuan [65] 2018 blockchain and cryptocurrencies kolb [66] 2020 core concepts, challenges, and future directions in blockchains 1.1.5 iot & iiot. the blockchain-based applications for internet of things (iot) [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] and industrial internet of things (iiot) [47, 48] have received the largest amount of attention from both academia and industry. for example, as a pioneer work in this category, christidis et al. [35] provided a survey about how blockchains and smart contracts promote the iot applications. later on, nguyen et al. [42] presented an investigation of the integration between blockchain technologies and cloud of things with in-depth discussion on backgrounds, motivations, concepts and architectures. recently, park et al. [45] emphasized on the topic of introducing blockchain technologies to the sustainable ecosystem of green iot. for the iiot, zhang et al. [48] discussed the integration of blockchain and edge intelligence to empower a secure iiot framework in the context of 5g and beyond. in addition, when applying blockchains to the unmanned aerial vehicles (uav), alladi et al. [49] reviewed numerous application scenarios covering both commercial and military domains such as network security, surveillance, etc. the research areas covered by the existing surveys on the blockchain-based applications include general applications [50, 51] , agriculture [52, 53] , software-defined networking (sdn) [54] , business applications [55] , smart city [56] , smart grids [57, 58] , distributed file systems [59] , space industry [60] , and covid-19 [61] . some of those surveys are reviewed as follows. lu et al. [50] performed a literature review on the fundamental features of blockchain-enabled applications. through the review, the authors expect to outlook the development routine of blockchain technologies. then, casino et al. [51] presented a systematic survey of blockchain-enabled applications in the context of multiple sectors and industries. both the current status and the prospective characteristics of blockchain technologies were identified. in more specific summary of survey-article review: through the brief review of the state-of-the-art surveys, we have found that the blockchain technologies have been adaptively integrated into a growing range of application sectors. the blockchain theory and technology will bring substantial innovations, incentives, and a great number of application scenarios in diverse fields. based on the analysis of those survey articles, we believe that there will be more survey articles published in the near future, very likely in the areas of sharding techniques, scalability, interoperability, smart contracts, big data, ai technologies, 5g and beyond, edge computing, cloud computing, and many other fields. via the overview, shown in table 1 , table 2 , fig. 1 and fig. 2 in a summary, by this article, we would like to fill the gap by emphasizing on the cutting-edge theoretical studies, modelings, and useful tools for blockchains. particularly, we try to include the latest high-quality research outputs that have not been included by other existing survey articles. we believe that this survey can shed new light on the further development of blockchains. our survey presented in this article includes the following contributions. • we conduct a brief classification of existing blockchain surveys to highlight the meaning of our literature review shown in this survey. • we then present a comprehensive investigation on the state-of-the-art theoretical modelings, analytics models, performance measurements, and useful experiment tools for blockchains, blockchain networks, and blockchain systems. • several promising directions and open issues for future studies are also envisioned finally. the structure of this survey is shown in fig. 3 and organized as follows. section 2 introduces the preliminaries of blockchains. section 3 summarizes the state-of-the-art theoretical studies that improve the performance of blockchains. in section 4, we then review various modelings and analytic models that help understand blockchains. diverse measurement approaches, datasets, and useful tools for blockchains are overviewed in section 5. we outlook the open issues in section 6. finally, section 7 concludes this article. blockchain is a promising paradigm for content distribution and distributed consensus over p2p networks. in this section, we present the basic concepts, definitions and terminologies of blockchains appeared in this article. manuscript submitted to acm 2.1 prime blockchain platforms 2.1.1 bitcoin. bitcoin is viewed as the blockchain system that executes the first cryptocurrency on this world. it builds upon two major techniques, i.e., nakamoto consensus and utxo model, which are introduced as follows. nakamoto consensus. to achieve an agreement of blocks, bitcoin adopts the nakamoto consensus, in which miners generate new blocks by solving a puzzle. in such a puzzle-solving process, also referred to as mining, miners need to calculate a nonce value that fits the required difficulty level [67] . through changing the difficulty, bitcoin system can maintain a stable rate of block-generation, which is about one block per 10 minutes. when a miner generates a new block, it broadcasts this message to all the other miners in the network. if others receive this new block, they add this block to their local chain. if all of the other miners receive this new block timely, the length of the main chain increases by one. however, because of the network delays, not always all the other miners can receive a new block in time. when a miner generates a block before it receives the previous one, a fork yields. bitcoin addresses this issue by following the rule of longest chain. utxo model. the unspent transaction output (utxo) model is adopted by cryptocurrencies like bitcoin, and other popular blockchain systems [68, 69] . a utxo is a set of digital money, each represents a chain of ownership between the owners and the receivers based on the cryptography technologies. in a blockchain, the overall utxos form a set, in which each element denotes the unspent output of a transaction, and can be used as an input for a future transaction. a client may own multiple utxos, and the total coin of this client is calculated by summing up all associated utxos. using this model, blockchains can prevent the double-spend [70] attacks efficiently. [71] is an open-source blockchain platform enabling the function of smart contract. as the token in ethereum, ether is rewarded to the miners who conducted computation to secure the consensus of the blockchain. ethereum executes on decentralized ethereum virtual machines (evms), in which scripts are running on a network consisting of public ethereum nodes. comparing with bitcoin, the evm's instruction set is believed turing-complete. ethereum also introduces an internal pricing mechanism, called gas. a unit of gas measures the amount of computational effort needed to execute operations in a transaction. thus, gas mechanism is useful to restrain the spam in smart contracts. ethereum 2.0 is an upgraded version based on the original ethereum. the upgrades include a transition from pow to proof-of-stake (pos), and a throughput-improving based on sharding technologies. eosio is another popular blockchain platform released by a company block.one on 2018. different from bitcoin and ethereum, the smart contracts of eosio don't need to pay transaction fees. its throughput is claimed to reach millions of transactions per second. furthermore, eosio also enables low block-confirmatoin latency, low-overhead bft finality, and etc. these excellent features has attracted a large-number of users and developers to quickly and easily deploy decentralized applications in a governed blockchain. for example, in total 89,800,000 eosio blocks have been generated in less than one and a half years since its first launching. the consensus mechanism in blockchains is for fault-tolerant to achieve an agreement on the same state of the blockchain network, such as a single state of all transactions in a cryptocurrency blockchain. popular proof-based consensus protocols include pow and pos. in pow, miners compete with each other to solve a puzzle that is difficult to produce a result but easy to verify the result by others. once a miner yields a required nonce value through a huge number of attempts, it gets paid a certain cryptocurrencies for creating a new block. in contrast, pos doesn't have miners. instead, the new block is forged by validators selected randomly within a committee. the probability to be chosen as a validator is linearly related to the size of its stake. pow and pos are both adopted as consensus protocols for the security of cryptocurrencies. the former is based on the cpu power, and the latter on the coin age. therefore, pos is with lower energy-cost and less likely to be attacked by the 51% attack. blockchain as a distributed and public database of transactions has become a platform for decentralized applications. despite its increasing popularity, blockchain technology faces the scalability problem: throughput does not scale with the increasing network size. thus, scalable blockchain protocols that can solve the scalability issues are still in an urgent need. many different directions, such as off-chain, dag, and sharding techniques, have been exploited to address the scalability of blockchains. here, we present several representative terms related to scalability. mathematically, a dag is a finite directed graph where no directed cycles exist. in the context of blockchain, dag is viewed as a revolutionized technology that can upgrade blockchain to a new generation. this is because dag is blockless, and all transactions link to multiple other transactions following a topological order on a dag network. thus, data can move directly between network participants. this results in a faster, cheaper and more scalable solution for blockchains. in fact, the bottleneck of blockchains mainly relies on the structure of blocks. thus, probably the blockless dag could be a promising solution to improve the scalability of blockchains substantially. technique. the consensus protocol of bitcoin, i.e., nakamoto consensus, has significant drawbacks on the performance of transaction throughput and network scalability. to address these issues, sharding technique is one of the outstanding approaches, which improves the throughput and scalability by partitioning the blockchain network into several small shards such that each can process a bunch of unconfirmed transactions in parallel to generate medium blocks. such medium blocks are then merged together in a final block. basically, sharding technique includes network sharding, transaction sharding and state sharding. one shortcoming of sharding technique is that the malicious network nodes residing in the same shard may collude with each other, resulting in security issues. therefore, the sharding-based protocols exploits reshuffling strategy to address such security threats. however, reshuffling brings the cross-shard data migration. thus, how to efficiently handle the cross-shard transactions becomes an emerging topic in the context of sharding blockchain. manuscript submitted to acm 3.1.1 throughput & latency. aiming to reduce the confirmation latency of transactions to milliseconds, hari et al. [72] proposed a high-throughput, low-latency, deterministic confirmation mechanism called accel for accelerating bitcoin's block confirmation. the key findings of this paper includes how to identify the singular blocks, and how to use singular blocks to reduce the confirmation delay. once the confirmation delay is reduced, the throughput increases accordingly. two obstacles have hindered the scalability of the cryptocurrency systems. the first one is the low throughput, and the other one is the requirement for every node to duplicate the communication, storage, and state representation of the entire blockchain network. wang et al. [73] studied how to solve the above obstacles. without weakening decentralization and security, the proposed monoxide technique offers a linear scale-out ability by partitioning the workload. and they preserved the simplicity of the blockchain system and amplified its capacity. the authors also proposed a novel chu-ko-nu mining mechanism, which ensures the cross-zone atomicity, efficiency and security of the however, the authors also admitted that although the proposed prism has a high throughput, its confirming latency still maintains as large as 10 seconds since there is only a single voter chain in prism. a promising solution is to introduce a large number of such voter chains, each of which is not necessarily secure. even though every voter chain is under attacking with a probability as high as 30%, the successful rate of attacking a half number of all voter chains is still theoretically very low. thus, the authors believed that using multiple voter chains would be a good solution to reducing the confirmation latency while not sacrificing system security. considering that ethereum simply allocates transactions to shards according to their account addresses rather than relying on the workload or the complexity of transactions, the resource consumption of transactions in each shard is unbalanced. in consequence, the network transaction throughput is affected and becomes low. to solve this problem, woo et al. [75] proposed a heuristic algorithm named garet, which is a gas consumption-aware relocation mechanism for improving throughput in sharding-based ethereum environments. in particular, the proposed garet can relocate transaction workloads of each shard according to the gas consumption. the experiment results show that garet achieves a higher transactions throughput and a lower transaction latency compared with existing techniques. the transactions generated at real-time make the size of blockchains keep growing. for example, the storage efficiency of original-version bitcoin has received much criticism since it requires to store the full transaction history in each bitcoin peer. although some revised protocols advocate that only the full-size nodes store the entire copy of whole ledger, the transactions still consume a large storage space in those full-size nodes. to alleviate this problem, several pioneer studies proposed storage-efficient solutions for blockchain networks. for example, by exploiting the erasure code-based approach, perard et al. [76] proposed a low-storage blockchain mechanism, aiming to achieve a low requirement of storage for blockchains. the new low-storage nodes only have to store the linearly encoded fragments of each block. the original blockchain data can be easily recovered by retrieving fragments from other nodes under the erasure-code framework. thus, this type of blockchain nodes allows blockchain clients to reduce the storage table 3 . latest theories of improving the performance of blockchains. throughput [72] reduce confirmation delay authors proposed a high-throughput, low-latency, deterministic confirmation mechanism, aiming to accelerate bitcoin's block confirmation. & latency the proposed monoxide offers a linear scale-out by partitioning workloads. particularly, chu-konu mining mechanism enables the cross-zone atomicity, efficiency and security of the system. [74] prism authors proposed a new blockchain protocol, i.e., prism, aiming to achieve a scalable throughput with a full security of bitcoin. [75] garet authors proposed a gas consumption-aware relocation mechanism for improving throughput in sharding-based ethereum. storage [76] erasure codebased authors proposed a new type of low-storage blockchain nodes using erasure code theory to reduce the storage space of blockchains. efficiency [77] jidar: data-reduction strategy authors proposed a data reduction strategy for bitcoin namely jidar, in which each node only has to store the transactions of interest and the related merkle branches from the complete blocks. [78] segment blockchain authors proposed a data-reduced storage mechanism named segment blockchain such that each node only has to store a segment of the blockchain. reliability [79] availability of blockchains authors studied the availability for blockchain-based systems, where the read and write availabilities are conflict to each other. analysis [80] reliability prediction authors proposed h-brp to predict the reliability of blockchain peers by extracting their reliability parameters. capacity. the authors also tested their system on the low-configuration raspberry pi to show the effectiveness, which demonstrates the possibility towards running blockchains on iot devices. then, dai et al. [77] proposed jidar, which is a data reduction strategy for bitcoin. in jidar, each node only has to store the transactions of interest and the related merkle branches from the complete blocks. all nodes verify transactions collaboratively by a query mechanism. this approach seems very promising to the storage efficiency of bitcoin. however, their experiments show that the proposed jidar can only reduce the storage overhead of each peer by about 1% comparing with the original bitcoin. under the similar idea, xu et al. [78] reduced the storage of blockchains using a segment blockchain mechanism, in which each node only needs to store a piece of blockchain segment. the authors also proved that the proposed mechanism endures a failure probability (ϕ/n) m if an adversary party commits a collusion with less than a number ϕ of nodes and each segment is stored by a number m of nodes. this theoretical result is useful for the storage design of blockchains when developing a particular segment mechanism towards data-heavy distributed applications. in public blockchains, the system clients join the blockchain network basically through a third-party peer. thus, the reliability of the selected blockchain peer is critical to the security of clients in terms of both resource-efficiency and monetary issues. to enable clients evaluate and choose the reliable blockchain peers, zheng et al. [80] proposed a hybrid reliability prediction model for blockchains named h-brp, which is able to predict the reliability of blockchain peers by extracting their reliability parameters. manuscript submitted to acm sharding [81] rapidchain authors proposed a new sharding-based protocol for public blockchains that achieves nonlinearly increase of intra-committee communications with the number of committee memebers. blockchains [82] sharper authors proposed a permissioned blockchain system named sharper, which adopts sharding techniques to improve scalability of cross-shard transactions. [83] d-gas authors proposed a dynamic load balancing mechanism for ethereum shards, i.e., d-gas. it reallocates tx accounts by their gas consumption on each shard. [84] nrss authors proposed a node-rating based new sharding scheme, i.e., nrss, for blockchains, aiming to improve the throughput of committees. [85] optchain authors proposed a new sharding paradigm, called optchain, mainly used for optimizing the placement of transactions into shards. [86] sharding-based scaling system authors proposed an efficient shard-formation protocol that assigns nodes into shards securely, and a distributed transaction protocol that can guard against malicious byzantine fault coodinotors. [87] sschain authors proposed a non-reshuffling structure called sschain, which supports both transaction sharding and state sharding while eliminating huge data-migration across shards. [88] eumonia authors proposed eumonia, which is a permissionless parallel-chain protocol for realizing a global ordering of blocks. [89] vulnerability of sybil attacks authors systematically analyzed the vulnerability of sybil attacks in protocol elastico. [90] n/2 bft sharding approach authors proposed a new blockchain sharding approach that can tolerate up to 1/2 of the byzantine nodes within a shard. [91] cycledger authors proposed a protocol cycledger to pave a way towards scalability, security and incentive for sharding blockchains. interoperability [92] interoperability architecture authors proposed a novel interoperability architecture that supports the cross-chain cooperations among multiple blockchains, and a novel monitor multiplexing reading (mmr) method for the passive cross-chain communications. of multiple-chain [93] hyperservice authors proposed a programming platform that provides interoperability and programmability over multiple heterogeneous blockchains. systems [94] protocol move authors proposed a programming model for smart-contract developers to create dapps that can interoperate and scale in a multiple-chain envrionment. [95] crosscryptocurrency tx protocol authors proposed a decentralized cryptocurrency exchange protocol enabling crosscryptocurrency transactions based on smart contracts deployed on ethereum. [16] cross-chain comm. authors conducted a systematic classification of cross-chain communication protocols. one of the critical bottlenecks of today's blockchain systems is the scalability. for example, the throughput of a blockchain is not scalable when the network size grows. to address this dilemma, a number of scalability approaches have been proposed. in this part, we conduct an overview of the most recent solutions with respect to sharding techniques, interoperability among multiple blockchains, and other solutions. some early-stage sharding blockchain protocols (e.g., elastico) improve the scalability by enforcing multiple groups of committees work in parallel. however, this manner still requires a large amount of communication for verifying every transaction linearly increasing with the number of nodes within a committee. thus, the benefit of sharding policy was not fully employed. as an improved solution, zamani et al. [81] proposed a byzantine-resilient sharding-based protocol, namely rapidchain, for permissionless blockchains. taking the advantage of block pipelining, rapidchain improves the throughput by using a sound intra-committee consensus. the authors also developed an efficient cross-shard verification method to avoid the broadcast messages flooding in the holistic network. to enforce the throughput scaling with the network size, gao et al. [96] proposed a scalable blockchain protocol, which leverages both sharding and proof-of-stake consensus techniques. their experiments were performed in an amazon ec2-based simulation network. although the results showed that the throughput of the proposed protocol increases following the network size, the performance was still not so high, for example, the maximum throughput was 36 transactions per second and the transaction latency was around 27 seconds. aiming to improve the efficiency of cross-shard transactions, amiri et al. [82] proposed a permissioned blockchain system named sharper, which is strive for the scalability of blockchains by dividing and reallocating different data shards to various network clusters. the major contributions of the proposed sharper include the related algorithm and protocol associated to such sharper model. in the author's previous work, they have already proposed a permissioned blockchain, while in this paper the authors extended it by introducing a consensus protocol in the processing of both intra-shard and cross-shard transactions. finally, sharper was devised by adopting sharding techniques. one of the important contributions is that sharper can be used in the networks where there are a high percentage of non-faulty nodes. furthermore, this paper also contributes a flattened consensus protocol w.r.t the order of cross-shard transactions among all involved clusters. considering that the ethereum places each group of transactions on a shard by their account addresses, the workloads and complexity of transactions in shards are apparently unbalanced. this manner further damages the network throughput. to address this uneven problem, kim et al. [83] proposed d-gas, which is a dynamic load balancing mechanism for ethereum shards. using such d-gas, the transaction workloads of accounts on each shard can be reallocated according to their gas consumption. the target is to maximize the throughput of those transactions. the evaluation results showed that the proposed d-gas achieved at most a 12% superiority of transaction throughput and a 74% lower transaction latency comparing with other existing techniques. the random sharding strategy causes imbalanced performance gaps among different committees in a blockchain network. those gaps yield a bottleneck of transaction throughput. thus, wang et al. [84] proposed a new sharding policy for blockchains named nrss, which exploits node rating to assess network nodes according to their performance of transaction verifications. after such evaluation, all network nodes will be reallocated to different committees aiming at filling the previous imbalanced performance gaps. through the experiments conducted on a local blockchain system, the results showed that nrss improves throughput by around 32% under sharding techniques. sharding has been proposed to mainly improve the scalability and the throughput performance of blockchains. a good sharding policy should minimize the cross-shard communications as much as possible. a classic design of sharding is the transactions sharding. however, such transactions sharding exploits the random sharding policy, which leads to a dilemma that most transactions are cross-shard. to this end, nguyen et al. [85] proposed a new sharding paradigm differing from the random sharding, called optchain, which can minimize the number of cross-shard transactions. the authors achieved their goal through the following two aspects. first they designed two metrics, named t2s-score (transaction-to-shard) and l2s-score (latency-to-shard), respectively. t2s-score aims to measure how likely manuscript submitted to acm a transaction should be placed into a shard, while l2s-score is used to measure the confirmation latency when placing a transaction into a shard. next, they utilized a well-known pagerank analysis to calculate t2s-score and proposed a mathematical model to estimate l2s-score. finally, how does the proposed optchain place transactions into shards based on the combination of t2s and l2s scores? in brief, they introduced another metric composed of both t2s and l2s, called temporal fitness score. for a given transaction u and a shard s i , optchain figures the temporal fitness score for the pair ⟨u, s i ⟩. then, optchain just puts transaction u into the shard that is with the highest temporal fitness score. similar to [85] , dang et al. [86] proposed a new shard-formation protocol, in which the nodes of different shards are re-assigned into different committees to reach a certain safety degree. in addition, they also proposed a coordination protocol to handle the cross-shard transactions towards guarding against the byzantine-fault malicious coordinators. the experiment results showed that the throughput achieves a few thousands of tps in both a local cluster with 100 nodes and a large-scale google cloud platform testbed. considering that the reshuffling operations lead to huge data migration in the sharding-based protocols, chen et al. although the existing sharding-based protocols, e.g., elastico, ominiledger and rapaidchain, have gained a lot of attention, they still have some drawbacks. for example, the mutual connections among all honest nodes require a big amount of communication resources. furthermore, there is no an incentive mechanism driven nodes to participate in sharding protocol actively. to solve those problems, zhang et al. [91] proposed cycledger, which is a protocol designed for the sharding-based distributed ledger towards scalability, reliable security, and incentives. such the proposed cycledger is able to select a leader and a subset of nodes for each committee that handle the intra-shard consensus and the synchronization with other committees. a semi-commitment strategy and a recovery processing scheme were also proposed to deal with system crashing. in addition, the authors also proposed a reputation-based incentive policy to encourage nodes behaving honestly. following the widespread adoption of smart contracts, the roles of blockchains have been upgraded from token exchanges into programmable state machines. thus, the blockchain interoperability must evolve accordingly. to help realize such new type of interoperability among multiple heterogeneous blockchains, liu et al. [93] proposed hyperservice, which includes two major components, i.e., a programming framework allowing developers to create crosschain applications; and a universal interoperability protocol towards secure implementation of dapps on blockchains. the authors implemented a 35,000-line prototype to prove the practicality of hyperservice. using the prototype, the end-to-end delays of cross-chain dapps, and the aggregated platform throughput can be measured conveniently. in an ecosystem that consists of multiple blockchains, interoperability among those difference blockchains is an essential issue. to help the smart-contract developers build dapps, fynn et al. [94] proposed a practical move protocol that works for multiple blockchains. the basic idea of such protocol is to support a move operation enabling to move objects and smart contracts from one blockchain to another. recently, to enable cross-cryptocurrency transactions, tian et al. [95] proposed a decentralized cryptocurrency exchange strategy implemented on ethereum through smart contracts. additionally, a great number of studies of cross-chain communications are included in [16] , in which readers can find a systematic classification of cross-chain communication protocols. new protocols [97] ouroboros praos authors proposed a new secure proof-of-stake protocol named ouroboros praos, which is proved secure in the semi-synchronous adversarial setting. [98] tendermint authors proposed a new bft consensus protocol for the wide area network organized by the gossip-based p2p network under adversarial conditions. [73] chu-ko-nu mining authors proposed a novel proof-of-work scheme, named chu-ko-nu mining, which incentivizes miners to create multiple blocks in different zones with only a single pow mining. [99] proof-of-trust (pot) authors proposed a novel proof-of-trust consensus for the online services of crowdsourcing. new [100] streamchain authors proposed to shift the block-based distributed ledgers to a new paradigm of stream transaction processing to achieve a low end-to-end latencies without much affecting throughput. in monoxide proposed by [73] , the authors have devised a novel proof-of-work scheme, named chu-ko-nu mining. this new proof protocol encourages a miner to create multiple blocks in different zones simultaneously with a single pow solving effort. this mechanism makes the effective mining power in each zone is almost equal to the level of the total physical mining power in the entire network. thus, chu-ko-nu mining increases the attack threshold for each zone to 50%. furthermore, chu-ko-nu mining can improve the energy consumption spent on mining new blocks because a lot of more blocks can be produced in each round of normal pow mining. the online services of crowdsourcing face a challenge to find a suitable consensus protocol. by leveraging the advantages of the blockchain such as the traceability of service contracts, zou et al. [99] proposed a new consensus protocol, named proof-of-trust (pot) consensus, for crowdsourcing and the general online service industries. basically, such pot consensus protocol leverages a trust management of all service participants, and it works as a hybrid blockchain architecture in which a consortium blockchain integrates with a public service network. conventionally, block-based data structure is adopted by permissionless blockchain systems as blocks can efficiently amortize the cost of cryptography. however, the benefits of blocks are saturated in today's permissioned blockchains since the block-processing introduces large batching latencies. to the distributed ledgers that are neither geo-distributed nor pow-required, istván et al. [100] proposed to shift the traditional block-based data structure into the paradigm of stream-like transaction processing. the premier advantage of such paradigm shift is to largely shrink the end-to-end latencies for permissioned blockchains. the authors developed a prototype of their concept based on hyperledger fabric. the results showed that the end-to-end latencies achieved sub-10 ms and the throughput was close to 1500 tps. permissioned blockchains have a number of limitations, such as poor performance, privacy leaking, and inefficient cross-application transaction handling mechanism. to address those issues, amiri et al. [101] proposed caper, which a permissioned blockchain that can well deal with the cross-application transactions for distributed applications. in particular, caper constructs its blockchain ledger using dag and handles the cross-application transactions by adopting three specific consensus protocols, i.e., a global consensus using a separate set of orders, a hierarchical consensus protocol, and a one-level consensus protocol. then, chang et al. [102] proposed an edge computing-based blockchain [105] architecture, in which edge-computing providers supply computational resources for blockchain miners. the authors then formulated a two-phase stackelberg game for the proposed architecture, aiming to find the stackelberg equilibrium of the theoretical optimal mining scheme. next, zheng et al. [103] proposed a new infrastructure for practical pow blockchains called axechain, which aims to exploit the precious computing power of miners to solve arbitrary practical problems submitted by system users. the authors also analyzed the trade-off between energy consumption and security guarantees of such axechain. this study opens up a new direction for pursing high energy efficiency of meaningful pow protocols. with the non-linear (e.g., graphical) structure adopted by blockchain networks, researchers are becoming interested in the performance improvement brought by new data structures. to find insights under such non-linear blockchain systems, chen et al. [104] performed a systematic analysis by taking three critical metrics into account, i.e., full verification, scalability, and finality-duration. the authors revealed that it is impossible to achieve a blockchain that enables those three metrics at the same time. any blockchain designers must consider the trade-off among such three properties. the graphs are widely used in blockchain networks. for example, merkel tree has been adopted by bitcoin, and several blockchain protocols, such as ghost [106] , phantom [107] , and conflux [108] , constructed their blocks using the directed acyclic graph (dag) technique. different from those generalized graph structures, we review the most recent studies that exploit the graph theories for better understanding blockchains in this part. since the transactions in blockchains are easily structured into graphs, the graph theories and graph-based data mining techniques are viewed as good tools to discover the interesting findings beyond the graphs of blockchain networks. some representative recent studies are reviewed as follows. leveraging the techniques of graph analysis, chen et al. [109] characterized three major activities on ethereum, i.e., money transfer, the creation of smart contracts, and the invocation of smart contracts. the major contribution of this paper is that it performed the first systematic investigation and proposed new approaches based on cross-graph analysis, which can address two security issues existing in ethereum: attack forensics and anomaly detection. particularly, w.r.t the graph theory, the authors mainly concentrated on the following two aspects: (1) graph construction: they identified four types of transactions that are not related to money transfer, smart contract creation, or smart contract invocation. (2) graph analysis: then, they divided the remaining transactions into three groups according to the activities they triggered, i.e., money flow grahp (mfg), smart contract creation graph (ccg) and contract invocation graph (cig). via this manner, the authors delivered many useful insights of transactions that are helpful to address the security issues of ethereum. similarly, by processing bitcoin transaction history, akcora et al. [110] and dixon et al. [111] modeled the transfer network into an extreme transaction graph. through the analysis of chainlet activities [112] in the constructed graph, they proposed to use garch-based forecasting models to identify the financial risk of bitcoin market for cryptocurrency users. an emerging research direction associated with blockchain-based cryptocurrencies is to understand the network dynamics behind graphs of those blockchains, such as the transaction graph. this is because people are wondering what the connection between the price of a cryptocurrency and the dynamics of the overlying transaction graph is. to answer such a question, abay et al. [113] proposed chainnet, which is a computationally lightweight method to learning the graph features of blockchains. the authors also disclosed several insightful findings. for example, it is the topological feature of transaction graph that impacts the prediction of bitcoin price dynamics, rather than the degree distribution of the transaction graph. furthermore, utilizing the mt. gox transaction history, chen et al. [114] also exploited the graph-based data-mining approach to dig the market manipulation of bitcoin. the authors constructed three graphs, i.e., extreme high graph (ehg), extreme low graph (elg), and normal graph (nmg), based on the initial processing of transaction dataset. then, they discovered many correlations between market manipulation patterns and the price of bitcoin. on the other direction, based on address graphs, victor et al. [115] studied the erc20 token networks through analyzing smart contracts of ethereum blockchain. different from other graph-based approaches, the authors focused on their attention on the address graphs, i.e., token networks. with all network addresses, each token network is viewed as an overlay graph of the entire ethereum network addresses. similar to [109] , the authors presented the relationship between transactions by exploiting graph-based analysis, in which the arrows can denote the invoking functions between transactions and smart contracts, and the token transfers between transactions as well. the findings presented by this study help us have a well understanding of token networks in terms of time-varying characteristics, such as the usage patterns of the blockchain system. an interesting finding is that around 90% of all transfers stem from the top 1000 token contracts. that is to say, only less than 10% of token recipients have transferred their tokens. this finding is contrary to the viewpoint proposed by [116] , where somin et al. showed that the full transfers seem to obey a power-law distribution. however, the study [115] indicated that those transfers in token networks likely do not follow a power law. the authors attributed such the observations to the following three possible reasons: 1) most of the token users don't have incentives to transfer their tokens. instead, they just simply hold tokens; 2) the majority of inactive tokens are treated as something like unwanted spam; 3) a small portion, i.e., approximately 8%, of users intended to sell their tokens to a market exchange. recently, zhao et al. [117] explored the account creation, account vote, money transfer and contract authorization activities of early-stage eosio transactions through graph-based metric analysis. their study revealed abnormal transactions like voting gangs and frauds. the latencies of block transfer and processing are generally existing in blockchain networks since the large number of miner nodes are geographically distributed. such delays increase the probability of forking and the vulnerability to malicious attacks. thus, it is critical to know how would the network dynamics caused by the block propagation latencies and the fluctuation of hashing power of miners impact the blockchain performance such as block generation rate. to find the connection between those factors, papadis et al. [118] developed stochastic models to derive the blockchain evolution in a wide-area network. their results showed us practical insights for the design issues of blockchains, for example, how to change the difficulty of mining in the pow consensus while guaranteeing an expected block generation rate or an immunity level of adversarial attacks. the authors then performed analytical studies and simulations to evaluate the accuracy of their models. this stochastic analysis opens up a door for us to have a deeper understanding of dynamics in a blockchain network. towards the stability and scalability of blockchain systems, gopalan et al. [119] also proposed a stochastic model for a blockchain system. during their modeling, a structural asymptotic property called one-endedness was identified. the authors also proved that a blockchain system is one-ended if it is stochastically stable. the upper and lower bounds of the stability region were also studied. the authors found that the stability bounds are closely related to the conductance of the p2p blockchain network. those findings are very insightful such that researchers can assess the scalability of blockchain systems deployed on large-scale p2p networks. although sharding protocol is viewed as a very promising solution to solving the scalability of blockchains and adopted by multiple well-known blockchains such as rapidchain [81] , omniledger [69] , and monoxide [73] , the failure probability for a committee under sharding protocol is still unknown. to fill this gap, hafid et al. [120] [121] [122] proposed a stochastic model to capture the security analysis under sharding-based blockchains using a probabilistic approach. with the proposed mathematical model, the upper bound of the failure probability was derived for a committee. in particular, three probability inequalities were used in their model, i.e., chebyshev, hoeffding, and chvátal. the authors claim that the proposed stochastic model can be used to analyze the security of any sharding-based protocol. in blockchain networks, several stages of mining processing and the generation of new blocks can be formulated as queueing systems, such as the transaction-arrival queue, the transaction-confirmation queue, and the block-verification queue. thus, a growing number of studies are exploiting the queueing theory to disclose the mining and consensus mechanisms of blockchains. some recent representative works are reviewed as follows. to develop a queueing theory of blockchain systems, li et al. [123, 124] devised a batch-service queueing system to describe the mining and the creating of new blocks in miners' pool. for the blockchain queueing system, the authors exploited the type gi/m/1 continuous-time markov process. then, they derived the stable condition and the stationary probability matrix of the queueing system utilizing the matrix-geometric techniques. then, viewing that the confirmation delay of bitcoin transactions are larger than conventional credit card systems, ricci et al. [125] proposed a theoretical framework integrating the queueing theory and machine learning techniques to have a deep understanding towards the transaction confirmation time. the reason the authors chose the queueing theory for their study is that a queueing model is suitable to see insights into how the different blockchain parameters affect the transaction latencies. their measurement results showed that the bitcoin users experience a delay that is slightly larger than the residual time of a block confirmation. frolkova et al. [126] formulated the synchronization process of bitcoin network as an infinite-server model. the authors derived a closed-form for the model that can be used to capture the queue stationary distribution. furthermore, they also proposed a random-style fluid limit under service latencies. on the other hand, to evaluate and optimize the performance of blockchain-based systems, memon et al. [128] via graph analysis, authors extracted three major activities, i.e., money transfer, smart contracts creation, and smart contracts invocation. based mining [113] features of transaction graphs proposed an extendable and computationally efficient method for graph representation learning on blockchains. theories [114] market manipulation patterns authors exploited the graph-based data-mining approach to reveal the market manipulation evidence of bitcoin. [117] clustering coefficient, assortativity of tx graph authors exploited the graph-based analysis to reveal the abnormal transactions of eosio. token networks [115] token-transfer distributions authors studied the token networks through analyzing smart contracts of ethereum blockchain based on graph analysis. [110, 111] extreme chainlet activity authors proposed graph-based analysis models for assessing the financial investment risk of bitoin. blockchain network analysis [118] block completion rates, and the probability of a successful adversarial attack authors derived stochastic models to capture critical blockchain properties, and to evaluate the impact of blockchain propagation latency on key performance metrics. this study provides us useful insights of design issues of blockchain networks. stability analysis [119] time to consistency, cycle length, consistency fraction, age of information authors proposed a network model which can identify the stochastic stability of blockchain systems. failure probability analysis [120] [121] [122] failure probability of a committee, sums of upper-bounded hypergeometric and binomial distributions for each epoch authors proposed a probabilistic model to derive the security analysis under sharding blockchain protocols. this study can tell how to keep the failure probability smaller than a defined threshold for a specific sharding protocol. mining procedure and blockgeneration [123, 124] the average # of tx in the arrival queue and in a block, and average confirmation time of tx authors developed a makovian batch-service queueing system to express the mining process and the generation of new blocks in miners pool. blockconfirmation time [125] the residual lifetime of a block till the next block is confirmed authors proposed a theoretical framework to deeply understand the transaction confirmation time, by integrating the queueing theory and machine learning techniques. synchronization process of bitcoin network [126] stationary queue-length distribution authors proposed an infinite-server model with random fluid limit for bitcoin network. mining resources allocation [127] mining resource for miners, queueing stability authors proposed a lyapunov optimization-based queueing analytical model to study the allocation of mining resources for the pow-based blockchain networks. blockchain's theoretical working principles [128] # of tx per block, mining interval of each block, memory pool size, waiting time, # of unconfirmed tx authors proposed a queueing theory-based model to have a better understanding the theoretical working principle of blockchain networks. critical statistics metrics of blockchain networks, such as the number of transactions every new block, the mining interval of a block, transactions throughput, and the waiting time in memory pool, etc. next, fang et al. [127] proposed a queueing analytical model to allocate mining resources for the general pow-based blockchain networks. the authors formulated the queueing model using lyapunov optimization techniques. based on such stochastic theory, a dynamic allocation algorithm was designed to find a trade-off between mining energy and queueing delay. different from the aforementioned work [123] [124] [125] , the proposed lyapunov-based algorithm does not need to make any statistical assumptions on the arrivals and services. for the people considering whether a blockchain system is needed for his/her business, a notable fact is that blockchain is not always applicable to all real-life use cases. to help analyze whether blockchain is appropriate to a specific application scenario, wust et al. [129] provided the first structured analytical methodology and applied it to analyzing authors proposed the first structured analytical methodology that can help decide whether a particular application system indeed needs a blockchain, either a permissioned or permissionless, as its technical solution. exploration of [130] temporal information and the multiplicity features of ethereum transactions authors proposed an analytical model based on the multiplex network theory for understanding ethereum transactions. ethereum transactions [131] pending time of ethereum transactions authors conducted a characterization study of the ethereum by focusing on the pending time, and attempted to find the correlation between pending time and fee-related parameters of ethereum. modeling the competition over multiple miners [132] competing mining resources of miners of a cryptocurrency blockchain authors exploited the game theory to find a nash equilibria while peers are competing mining resources. a neat bound of consistency latency [133] consistency of a pow blockchain authors derived a neat bound of mining latencies that helps understand the consistency of nakamoto's blockchain consensus in asynchronous networks. network connectivity [134] consensus security authors proposed an analytical model to evaluate the impact of network connectivity on the consensus security of pow blockchain under different adversary models. how ethereum responds to sharding [135] balance among shards, # of tx that would involve multiple shards, the amount of data relocated across shards authors studied how sharding impact ethereum by firstly modeling ethereum through graph modeling, and then assessing the three metrics mentioned when partitioning the graph. required properties of sharding protocols [136] consistency and scalability authors proposed an analytical model to evaluate whether a protocol for sharded distributed ledgers fulfills necessary properties. vulnerability by forking attacks [137] hashrate power, net cost of an attack authors proposed fine-grained vulnerability analytical model of blockchain networks incurred by intentional forking attacks taking the advantages of large deviation theory. counterattack to double-spend attacks [70] robustness parameter, vulnerability probability authors studied how to defense and even counterattack the double-spend attacks in pow blockchains. limitations of pbftbased blockchains [138] performance of blockchain applications, persistence, possibility of forks authors studied and identified several misalignments between the requirements of permissioned blockchains and the classic bft protocols. three representative scenarios, i.e., supply chain management, interbank payments, and decentralized autonomous organizations. although ethereum has gained much popularity since its debut in 2014, the systematically analysis of ethereum transactions still suffers from insufficient explorations. therefore, lin et al. [130] proposed to model the transactions using the techniques of multiplex network. the authors then devised several random-walk strategies for graph representation of the transactions network. this study could help us better understand the temporal data and the multiplicity features of ethereum transactions. to better understand the network features of an ethereum transaction, sousa et al. [131] focused on the pending time, which is defined as the latency counting from the time a transaction is observed to the time this transaction is packed into the blockchain. the authors tried to find the correlations between such pending time with the fee-related parameters such as gas and gas price. surprisingly, their data-driven empirical analysis results showed that the correlation between those two factors has no clear clue. this finding is counterintuitive. to achieve a consensus about the state of blockchains, miners have to compete with each other by invoking a certain proof mechanism, say pow. such competition among miners is the key module to public blockchains such as bitcoin. to model the competition over multiple miners of a cryptocurrency blockchain, altman et al. [132] exploited the game theory to find a nash equilibria while peers are competing mining resources. the proposed approach help researchers well understand such competition. however, the authors also mentioned that they didn't study the punishment and cooperation between miners over the repeated games. those open topics will be very interesting for future studies. to ensure the consistency of pow blockchain in an asynchronous network, zhao et al. [133] performed an analysis and derived a neat bound around 2µ ln(µ/ν ) , where µ + ν = 1, with µ and ν denoting the fraction of computation power dominated by the honest and adversarial miners, respectively. such a neat bound of mining latencies is helpful to us to well understand the consistency of nakamoto's blockchain consensus in asynchronous networks. bitcoin's consensus security is built upon the assumption of honest-majority. under this assumption, the blockchain system is thought secure only if the majority of miners are honest while voting towards a global consensus. recent researches believe that network connectivity, the forks of a blockchain, and the strategy of mining are major factors that impact the security of consensus in bitcoin blockchain. to provide pioneering concrete modelings and analysis, xiao et al. [134] proposed an analytical model to evaluate the network connectivity on the consensus security of pow blockchains. to validate the effectiveness of the proposed analytical model, the authors applied it to two adversary scenarios, i.e., honest-but-potentially-colluding, and selfish mining models. although sharding is viewed as a prevalent technique for improving the scalability to blockchain systems, several essential questions are: what we can expect from and what price is required to pay for introducing sharding technique to ethereum? to answer those questions, fynn et al. [135] studied how sharding works for ethereum by modeling ethereum into a graph. via partitioning the graph, they evaluated the trade-off between the edge-cut and balance. several practical insights have been disclosed. for example, three major components, e..g, computation, storage and bandwidth, are playing a critical role when partitioning ethereum; a good design of incentives is also necessary for adopting sharding mechanism. as mentioned multiple times, sharding technique is viewed as a promising solution to improving the scalability of blockchains. however, the properties of a sharded blockchain under a fully adaptive adversary are still unknown. to this end, avarikioti et al. [136] defined the consistency and scalability for sharded blockchain protocol. the limitations of security and efficiency of sharding protocols were also derived. then, they analyzed these two properties on the context of multiple popular sharding-based protocols such as omniledger, rapidchain, elastico, and monoxide. several interesting conclusions have been drawn. for example, the authors thought that elastico and momoxide failed to guarantee the balance between consistency and scalability properties, while omniledger and rapidchain fulfill all requirements of a robust sharded blockchain protocol. forking attacks has become the normal threats faced by the blockchain market. the related existing studies mainly focus on the detection of such attacks through transactions. however, this manner cannot prevent the forking attacks from happening. to resist the forking attacks, wang et al. [137] studied the fine-grained vulnerability of blockchain networks caused by intentional forks using the large deviation theory. this study can help set the robustness parameters for a blockchain network since the vulnerability analysis provides the correlation between robust level and the vulnerability probability. in detail, the authors found that it is much more cost-efficient to set the robust level parameters than to spend the computational capability used to lower the attack probability. the existing economic analysis [139] reported that the attacks towards pow mining-based blockchain systems can be cheap under a specific condition when renting sufficient hashrate capability. moroz et al. [70] studied how to defense the double-spend attacks in an interesting reverse direction. the authors found that the counterattack of victims can lead to a classic game-theoretic war of attrition model. this study showed us the double-spend attacks on some pow-based blockchains are actually cheap. however, the defense or even counterattack to such double-spend attacks is possible when victims are owing the same capacity as the attacker. although bft protocols have attracted a lot of attention, there are still a number of fundamental limitations unaddressed while running blockchain applications based on the classical bft protocols. those limitations include one related to low performance issues, and two correlated to the gaps between the state machine replication and blockchain models (i.e., the lack of strong persistence guarantees and the occurrence of forks). to identify those limitations, bessani et al. [138] first studied them using a digital coin blockchain app called smartcoin, and a popular bft replication library called bft-smart, then they discussed how to tackle these limitations in a protocol-agnostic manner. the authors also implemented an experimental platform of permissioned blockchain, namely smartchain. their evaluation results showed that smartchain can address the limitations aforementioned and significantly improve the performance of a blockchain application. ref. cryptojacking [140] hardware performance counters authors proposed a machine learning-based solution to prevent cryptojacking attacks. detection [141] various system resource utilization authors proposed an in-browser cryptojacking detection approach (capjack), based on the latest capsnet. marketmanipulation mining [114] various graph characteristics of transaction graph authors proposed a mining approach using the exchanges collected from the transaction networks. predicting volatility of bitcoin price [111] various graph characteristics of extreme chainlets authors proposed a graph-based analytic model to predict the intraday financial risk of bitcoin market. money-laundering detection [142] various graph characteristics of transaction graph authors exploited machine learning models to detect potential money laundering activities from bitcoin transactions. ponzi-scheme [143] factors that affect scam persistence authors analyzed the demand and supply perspectives of ponzi schemes on bitcoin ecosystem. detection [144, 145] account and code features of smart contracts authors detected ponzi schemes for ethereum based on data mining and machine learning approaches. design problem of cryptoeconomic systems [146] price of xns token, subsidy of app developers authors presented a practical evidence-based example to show how data science and stochastic modeling can be applied to designing cryptoeconomic blockchains. pricing mining hardware [147] miner revenue, asic value authors studied the correlation between the price of mining hardware (asic) and the value volatility of underlying cryptocurrency. in mining. thus, any web users face severe risks from the cryptocurrency-hungry hackers. for example, the cryptojacking attacks [148] have raised growing attention. in such type of attacks, a mining script is embedded secretly by a hacker without notice from the user. when the script is loaded, the mining will begin in the background of the system and a large portion of hardware resources are requisitioned for mining. to tackle the cryptojacking attacks, tahir et al. [140] proposed a machine learning-based solution, which leverages the hardware performance counters as the critical features and can achieve a high accuracy while classifying the parasitic miners. the authors also built their approach into a browser extension towards the widespread real-time protection for web users. similarly, ning et al. [141] proposed capjack, which is an in-browser cryptojacking detector based on deep capsule network (capsnet) [149] technology. as mentioned previously, to detect potential manipulation of bitcoin market, chen et al. [114] proposed a graph-based mining to study the evidence from the transaction network built based on mt. gox transaction history. the findings of this study suggests that the cryptocurrency market requires regulation. to predict drastic price fluctuation of bitcoin, dixon et al. [111] studied the impact of extreme transaction graph (etg) activity on the intraday dynamics of the bitcoin prices. the authors utilized chainlets [112] (sub graphs of transaction graph) for developing their predictive models. manuscript submitted to acm [151] mentioned that money laundering conducted in the underground market can be detected using the bitcoin mixing services. however, they didn't present an essential anti-money laundering strategy in their paper. in contrast, utilizing a transaction dataset collected over three years, hu et al. [142] performed in-depth detection for discovering money laundering activities on bitcoin network. to identify the money laundering transactions from the regular ones, the authors proposed four types of classifiers based on the graph features appeared on the transaction graph, i.e., immediate neighbors, deepwalk embeddings, node2vec embeddings and decision tree-based. it is not common to introduce data science and stochastic simulation modelings into the design problem of cryptoeconomic engineering. laskowski et al. [146] presented a practical evidencebased example to show how this manner can be applied to designing cryptoeconomic blockchains. yaish et al. [147] discussed the relationship between the cryptocurrency mining and the market price of the special hardware (asics) that supports pow consensus. the authors showed that the decreasing volatility of bitcoin's price has a counterintuitive negative impact to the value of mining hardware. this is because miners are not financially incentivized to participate in mining, when bitcoin becomes widely adopted thus making its volatility decrease. this study also revealed that a mining hardware asic could be imitated by bonds and underlying cryptocurrencies such as bitcoins. although diverse blockchains have been proposed in recent years, very few efforts have been devoted to measuring the performance of different blockchain systems. thus, this part reviews the representative studies of performance measurements for blockchains. the measurement metrics include throughput, security, scalability, etc. as a pioneer work in this direction, gervais et al. [152] proposed a quantitative framework, using which they studied the security and performance of several pow blockchains, such as bitcoin, litecoin, dogecoin and ethereum. the authors focused on multiple metrics of security model, e.g., stale block rate, mining power, mining costs, the number of block confirmations, propagation ability, and the impact of eclipse attacks. they also conducted extensive simulations for the four blockchains aforementioned with respect to the impact of block interval, the impact of block size, and throughput. via the evaluation of network parameters about the security of pow blockchains, researchers can compare the security performance objectively, and thus help them appropriately make optimal adversarial strategies and the security provisions of pow blockchains. general mining-based blockchains, e.g., bitcoin and ethereum tps, the overheads of cross-zone transactions, the confirmation latency of transactions, etc. monoxide was implemented utilizing c++. rocksdb was used to store blocks and tx. the real-world testing system was deployed on a distributed configuration consisting of 1200 virtual machines, with each owing 8 cores and 32 gb memory. in total 48,000 blockchain nodes were exploited in the testbed. [74] general blockchains throughput and confirmation latency, scalability under different # of clients, forking rate, and resource utilization (cpu, network bandwidth) prism testbed is deployed on amazon ec2 instances each with 16 cpu cores, 16 gb ram, 400 gb nvme ssd, and a 10 gbps network interface. in total 100 prism client instances are connected into a topology in random 4-regular graph. [75] ethereum nasir et al. [153] conducted performance measurements and discussion of two versions of hyperledger fabric. the authors focused on the metrics including execution time, transaction latency, throughput and the scalability versus the number of nodes in blockchain platforms. several useful insights have been revealed for the two versions of hyperledger fabric. as already mentioned previously in [73] , the authors evaluated their proposed monoxide w.r.t the metrics including the scalability of tps as the number of network zones increase, the overhead of both cross-zone transactions and storage size, the confirmation latency of transactions, and the orphan rate of blocks. in [74] , the authors performed rich measurements for their proposed new blockchain protocol prism under limited network bandwidth and cpu resources. the performance evaluated includes the distribution of block propagation delays, the relationship between block size and mining rate, block size versus assembly time, the expected time to reach consensus on block hash, the expected time to reach consensus on blocks, etc. later, zheng et al. [154] proposed a scalable framework for monitoring the real-time performance blockchain systems. this work has evaluated four popular blockchain systems, i.e., ethereum, parity [158] , cryptape inter-enterprise trust automation (cita) [159] and hyperledger fabric [160] i) data analysis based on off-chain data to provide off-chain user behavior for blockchain developers, ii) exploring new features of eosio data that are different from those of ethereum, and iii) conducting a joint analysis of eosio with other blockchains. kalodner et al. [164] proposed blocksci, which is designed as an open-source software platform for blockchain analysis. under the architecture of blocksci, the raw blockchain data is parsed to produce the core blockchain data including transaction graph, indexes and scripts, which are then provided to the analysis library. together with the auxiliary data including p2p data, price data and user tags, a client can either directly query or read through a jupyter notebook interface. to evaluate the performance of private blockchains, dinh et al. [155] proposed a benchmarking framework, named blockbench, which can measure the data processing capability and the performance of various layers of a blockchain system. using such blockbench, the authors then performed detailed measurements and analysis of three blockchains, i.e., ethereum, parity and hyperledger. the results disclosed some useful experiences of those three blockchain systems. for example, today's blockchains are not scalable w.r.t data processing workloads, and several bottlenecks should be considered while designing different layers of blockchain in the software engineering perspective. ethereum has received enormous attention on the mining challenges, the analytics of smart contracts, and the management of block mining. however, not so many efforts have been spent on the information dissemination in authors also made this simulator open-source on github. in this section, we envision the open issues and promising directions for future studies. 6.1.3 cross-shard performance . although a number of committee-based sharding protocols [69, 73, 81, 165] have been proposed, those protocols can only endure at most 1/3 adversaries. thus, more robust byzantine agreement protocols need to be devised. furthermore, all the sharding-based protocols incur additional cross-shard traffics and latencies because of the cross-shard transactions. therefore, the cross-shard performance in terms of throughput, latency and other metrics, has to be well guaranteed in future studies. on the other hand, the cross-shard transactions are inherent for the cross-shard protocols. thus, the pros and cons of such the correlation between different shards are worthy investigating using certain modelings and theories such as graph-based analysis. 6.1.4 cross-chain transaction accelerating mechanisms . on cross-chain operations, [92] is essentially a pioneer step towards practical blockchain-based ecosystems. following this roadmap paved by [92] , we are exciting to anticipate the subsequent related investigations will appear soon in the near future. for example, although the inter-chain transaction experiments achieve an initial success, we believe that the secure cross-chain transaction accelerating mechanisms are still on the way. in addition, further improvements are still required for the interoperability among multiple blockchains, such as decentralized load balancing smart contracts for sharded blockchains. manuscript submitted to acm 6.1.5 ordering blocks for multiple-chain protocols . although multiple-chain techniques can improve the throughput by exploiting the parallel mining of multiple chain instances, how to construct and manage the blocks in all chains in a globally consistent order is still a challenge to the multiple-chain based scalability protocols and solutions. 6.1.6 hardware-assisted accelerating solutions for blockchain networks. to improve the performance of blockchains, for example, to reduce the latency of transaction confirmation, some advanced network technologies, such as rdma (remote direct memory access) and high-speed network cards, can be exploited in accelerating the data-access among miners in blockchain networks. 6.1.7 performance optimization in different blockchain network layers . the blockchain network is built over the p2p networks, which include several typical layers, such as mac layer, routing layer, network layer, and application layer. the bft-based protocols are essentially working for the network layer. in fact, performance improvements can be achieved by proposing various protocols, algorithms, and theoretical models for other layers of the blockchain network. 6.1.8 blockchain-assisted bigdata networks. although big data and blockchain have several performance metrics that are contrary to each other. for example, big data is a centralized management technology with an emphasize on the privacy-preserving oriented to diverse computing environments. the data processed by big data technology should ensure nonredundancy and unstructured architecture in a large-scale computing network. in contrast, blockchain technology builds on a decentralized, transparent and immutable architecture, in which data type is simple, data is structured and highly redundant. furthermore, the performance of blockchains require scalability and the off-chain computing paradigm. thus, how to integrate those two technologies together and pursue the mutual benefit for each other is an open issue that is worthy in-depth studies. for example, the potential research topics include how to design a suitable new blockchain architecture for big data technologies, and how to break the isolated data islands using blockchains while guaranteeing the privacy issues of big data. • exploiting more general queueing theories to capture the real-world arrival process of transactions, mining new blocks, and other queueing-related blockchain phases. • performing priority-based service policies while dealing with transactions and new blocks, to meet a predefined security or regulation level. • developing more general probabilistic models to characterize the correlations among the multiple performance parameters of blockchain systems. 6.2.2 privacy-preserving for blockchains. from the previous overview, we observe that most of the existing works under this category are discussing the blockchain-based security and privacy-preserving applications. the fact is that the security and privacy are also the critical issues of the blockchain itself. for example, the privacy of transactions could be hacked by attackers. however, dedicated studies focusing on those issues are still insufficient. mechanisms for malicious miners. the cryptojacking miners are reportedly existing in web browsers according to [140] . this type of malicious codes is commandeering the hardware resources such as computational capability and memory of web users. thus, the anti-cryptojacking mechanisms and strategies are necessary to develop for protecting normal browser users. 6.2.4 security issues of cryptocurrency blockchains. the security issues of cryptocurrency blockchains, such as double-spend attacks, frauds in smart contracts, have arisen growing attention from both industrial and academic fields. however, little efforts have been committed to the theoretical investigations towards the security issues of cryptocurrency blockchains. for example, the exploration of punishment and cooperation between miners over multiple chains is an interesting topic for cryptocurrency blockchains. thus, we expect to see broader perspectives of modeling the behaviors of both attackers and counterattackers in the context of monetary blockchain attacks. to most of the beginners in the field of the blockchain, they have a dilemma about lack of powerful simulation/emulation tools for verifying their new ideas or protocols. therefore, the powerful simulation/emulation platforms that are easy to deploy scalable testbeds for the experiments would be very helpful to the research community. through a brief review of state-of-the-art blockchain surveys at first, we found that a dedicated survey focusing on the theoretical modelings, analytical models and useful experiment tools for blockchains is still missing. to fill this gap, we then conducted a comprehensive survey of the state-of-the-art on blockchains, particularly in the perspectives of theories, modelings, and measurement/evaluation tools. the taxonomy of each topic presented in this survey tried to convey the new protocols, ideas, and solutions that can improve the performance of blockchains, and help people better understand the blockchains in a further level. we believe our survey provides a timely guidance on the theoretical insights of blockchains for researchers, engineers, educators, and generalized readers. survey of consensus protocols on blockchain applications blockchain consensus algorithms: the state of the art and future trends a survey on consensus mechanisms and mining management in blockchain networks sok: a consensus taxonomy in the blockchain era a survey about consensus algorithms used in blockchain a survey on consensus mechanisms and mining strategy management in blockchain networks sok: consensus in the age of blockchains a survey of distributed consensus protocols for blockchain networks a survey of attacks on ethereum smart contracts blockchain-based smart-contract languages: a systematic literature review an overview on smart contracts: challenges, advances and platforms sok: sharding on blockchain survey: sharding in blockchains research on scalability of blockchain technology: problems and methods solutions to scalability of blockchain: a survey sok: communication across distributed ledgers a systematic literature review of blockchain cyber security a survey of blockchain from security perspective a survey of blockchain technology on security, privacy, and trust in crowdsourcing services the security of big data in fog-enabled iot applications including blockchain: a survey a survey on privacy protection in blockchain system a comprehensive survey on blockchain: working, security analysis, privacy threats and potential applications blockchain data analysis: a review of status, trends and challenges dissecting ponzi schemes on ethereum: identification, analysis, and impact blockchain for cloud exchange: a survey blockchain for ai: review and open research challenges blockchain intelligence: when blockchain meets artificial intelligence when machine learning meets blockchain: a decentralized, privacy-preserving and secure design blockchain and machine learning for communications and networking systems a survey on blockchain: a game theoretical perspective blockchain security in cloud computing: use cases, challenges, and solutions when mobile blockchain meets edge computing integrated blockchain and edge computing systems: a survey, some research issues and challenges blockchain for 5g and beyond networks: a state of the art survey blockchains and smart contracts for the internet of things applications of blockchains in the internet of things: a comprehensive survey a review on the use of blockchain for the internet of things internet of things security: a top-down survey blockchain and iot integration: a systematic survey blockchain for internet of things: a survey survey on blockchain for internet of things integration of blockchain and cloud of things: architecture, applications and challenges blockchain for the internet of things: present and future when internet of things meets blockchain: challenges in distributed consensus blockchain technology toward green iot: opportunities and challenges a survey of iot applications in blockchain systems: architecture, consensus, and traffic modeling blockchain applications for industry 4.0 and industrial iot: a review edge intelligence and blockchain empowered 5g beyond for the industrial internet of things applications of blockchain in unmanned aerial vehicles: a review blockchain: a survey on functions, applications and open issues a systematic literature review of blockchain-based applications: current status, classification and open issues blockchain in agriculture: a systematic literature review security and privacy for green iot based agriculture review blockchain solutions and challenges deployment of blockchain technology in software defined networks: a survey blockchain for business applications: a systematic literature review a survey of blockchain technology applied to smart cities: research issues and challenges blockchain in smart grids: a review on different use cases blockchain technology for smart grids: decentralized nist conceptual model when blockchain meets distributed file systems: an overview, challenges, and open issues blockchain in space industry: challenges and solutions blockchain and ai-based solutions to combat coronavirus (covid-19)-like epidemics: a survey blockchain: the state of the art and future trends an overview of blockchain technology: architecture, consensus, and future trends blockchain challenges and opportunities: a survey blockchain and cryptocurrencies: model, techniques, and applications core concepts, challenges, and future directions in blockchain: a centralized tutorial bitcoin: a peer-to-peer electronic cash system a secure sharding protocol for open blockchains omniledger: a secure, scale-out, decentralized ledger via sharding double-spend counterattacks: threat of retaliation in proof-of-work systems ethereum: a secure decentralised generalised transaction ledger accel: accelerating the bitcoin blockchain for high-throughput, low-latency applications monoxide: scale out blockchains with asynchronous consensus zones prism: scaling bitcoin by 10,000 x garet: improving throughput using gas consumption-aware relocation in ethereum sharding environments erasure code-based low storage blockchain node jidar: a jigsaw-like data reduction approach without trust assumptions for bitcoin system segment blockchain: a size reduced storage mechanism for blockchain on availability for blockchain-based systems selecting reliable blockchain peers via hybrid blockchain reliability prediction rapidchain: scaling blockchain via full sharding sharper: sharding permissioned blockchains over network clusters gas consumption-aware dynamic load balancing in ethereum sharding environments a node rating based sharding scheme for blockchain optchain: optimal transactions placement for scalable blockchain sharding towards scaling blockchain systems via sharding sschain: a full sharding protocol for public blockchain without data migration overhead eunomia: a permissionless parallel chain protocol based on logical clock on the feasibility of sybil attacks in shard-based permissionless blockchains an n/2 byzantine node tolerate blockchain sharding approach cycledger: a scalable and secure parallel protocol for distributed ledger via sharding towards a novel architecture for enabling interoperability amongst multiple blockchains hyperservice: interoperability and programmability across heterogeneous blockchains smart contracts on the move enabling cross-chain transactions: a decentralized cryptocurrency exchange protocol scalable blockchain protocol based on proof of stake and sharding ouroboros praos: an adaptively-secure, semi-synchronous proof-of-stake blockchain the latest gossip on bft consensus a proof-of-trust consensus protocol for enhancing accountability in crowdsourcing services streamchain: do blockchains need blocks caper: a cross-application permissioned blockchain incentive mechanism for edge computing-based blockchain axechain: a secure and decentralized blockchain for solving easily-verifiable problems nonlinear blockchain scalability: a game-theoretic perspective credit-based payments for fast computing resource trading in edge-assisted internet of things secure high-rate transaction processing in bitcoin phantom: a scalable blockdag protocol scaling nakamoto consensus to thousands of transactions per second understanding ethereum via graph analysis bitcoin risk modeling with blockchain graphs blockchain analytics for intraday financial risk modeling forecasting bitcoin price with graph chainlets chainnet: learning on blockchain graphs with topological features market manipulation of bitcoin: evidence from mining the mt. gox transaction network measuring ethereum-based erc20 token networks network analysis of erc20 tokens trading on ethereum blockchain exploring eosio via graph characterization stochastic models and wide-area network measurements for blockchain design and analysis stability and scalability of blockchain systems a probabilistic security analysis of sharding-based blockchain protocols a methodology for a probabilistic security analysis of sharding-based blockchain protocols new mathematical model to analyze security of sharding-based blockchain protocols blockchain queue theory markov processes in blockchain systems learning blockchain delays: a queueing theory approach a bitcoin-inspired infinite-server model with a random fluid limit toward low-cost and stable blockchain networks simulation model for blockchain systems using queuing theory do you need a blockchain? modeling and understanding ethereum transaction records via a complex network approach an analysis of the fees and pending time correlation in ethereum blockchain competition between miners: a game theoretic perspective an analysis of blockchain consistency in asynchronous networks: deriving a neat bound modeling the impact of network connectivity on consensus security of proof-of-work blockchain challenges and pitfalls of partitioning blockchains divide and scale: formalization of distributed ledger sharding protocols corking by forking: vulnerability analysis of blockchain from byzantine replication to blockchain: consensus is only the beginning the economic limits of bitcoin and the blockchain the browsers strike back: countering cryptojacking and parasitic miners on the web capjack: capture in-browser crypto-jacking by deep capsule network through behavioral analysis characterizing and detecting money laundering activities on the bitcoin network analyzing the bitcoin ponzi scheme ecosystem detecting ponzi schemes on ethereum: towards healthier blockchain technology exploiting blockchain data to detect smart ponzi schemes on ethereum evidence based decision making in blockchain economic systems: from theory to practice pricing asics for cryptocurrency mining a first look at browser-based cryptojacking dynamic routing between capsules data mining for detecting bitcoin ponzi schemes money laundering in the bitcoin network: perspective of mixing services on the security and performance of proof of work blockchains performance analysis of hyperledger fabric platforms a detailed and real-time performance monitoring framework for blockchain systems blockbench: a framework for analyzing private blockchains measuring ethereum network peers local bitcoin network simulator for performance evaluation using lightweight virtualization parity documentation cita technical whitepaper hyperledger fabric: a distributed operating system for permissioned blockchains performance monitoring xblock-eth: extracting and exploring blockchain data from etherem xblock-eos: extracting and exploring blockchain data from eosio blocksci: design and applications of a blockchain analysis platform the honey badger of bft protocols key: cord-133273-kvyzuayp authors: christ, andreas; quint, franz title: artificial intelligence: research impact on key industries; the upper-rhine artificial intelligence symposium (ur-ai 2020) date: 2020-10-05 journal: nan doi: nan sha: doc_id: 133273 cord_uid: kvyzuayp the trirhenatech alliance presents a collection of accepted papers of the cancelled tri-national 'upper-rhine artificial inteeligence symposium' planned for 13th may 2020 in karlsruhe. the trirhenatech alliance is a network of universities in the upper-rhine trinational metropolitan region comprising of the german universities of applied sciences in furtwangen, kaiserslautern, karlsruhe, and offenburg, the baden-wuerttemberg cooperative state university loerrach, the french university network alsace tech (comprised of 14 'grandes 'ecoles' in the fields of engineering, architecture and management) and the university of applied sciences and arts northwestern switzerland. the alliance's common goal is to reinforce the transfer of knowledge, research, and technology, as well as the cross-border mobility of students. in the area of privacy-preserving machine learning, many organisations could potentially benefit from sharing data with other, similar organisations to train good models. health insurers could, for instance, work together on solving the automated processing of unstructured paperwork such as insurers' claim receipts. the issue here is that organisations cannot share their data with each other for confidentiality and privacy reasons, which is why secure collaborative machine learning where a common model is trained on distributed data to prevent information from the participants from being reconstructedis gaining traction. this shows that the biggest problem in the area of privacy-preserving machine learning is not technical implementation, but how much the entities involved (decision makers, legal departments, etc.) trust the technologies. as a result, the degree to which ai can be explained, and the amount of trust people have in it, will be an issue requiring attention in the years to come. the representation of language has undergone enormous development of late: new models and variants, which can be used for a range of natural language processing (nlp) tasks, seem to pop up almost monthly. such tasks include machine translation, extracting information from documents, text summarisation and generation, document classification, bots, and so forth. the new generation of language models, for instance, is advanced enough to be used to generate completely realistic texts. these examples reveal the rapid development currently taking place in the ai landscape, so much so that the coming year may well witness major advances or even a breakthrough in the following areas: • healthcare sector (reinforced by the covid-19 pandemic): ai facilitates the analysis of huge amounts of personal information, diagnoses, treatments and medical data, as well as the identification of patterns and the early identification and/or cure of disorders. • privacy concerns: how civil society should respond to the fast increasing use of ai remains a major challenge in terms of safeguarding privacy. the sector will need to explain ai to civil society in ways that can be understood, so that people can have confidence in these technologies. • ai in retail: increasing reliance on online shopping (especially in the current situation) will change the way traditional (food) shops function. we are already seeing signs of new approaches with self-scanning checkouts, but this is only the beginning. going forward, food retailers will (have to) increasingly rely on a combination of staff and automated technologies to ensure cost-effective, frictionless shopping. • process automation: an ever greater proportion of production is being automated or performed by robotic methods. • bots: progress in the field of language (especially in natural language processing, outlined above) is expected to lead to major advances in the take-up of bots, such as in customer service, marketing, help desk services, healthcare/diagnosis, consultancy and many other areas. the rapid pace of development means it is almost impossible to predict either the challenges we will face in the future or the solutions destined to simplify our lives. one thing we can say is that there is enormous potential here. the universities in the trirhenatech alliance are actively contributing interdisciplinary solutions to the development of ai and its associated technical, societal and psychological research questions. utilizing toes of a humanoid robot is difficult for various reasons, one of which is that inverse kinematics is overdetermined with the introduction of toe joints. nevertheless, a number of robots with either passive toe joints like the monroe or hrp-2 robots [1, 2] or active toe joints like lola, the toyota robot or toni [3, 4, 5] have been developed. recent work shows considerable progress on learning model-free behaviors using genetic learning [6] for kicking with toes and deep reinforcement learning [7, 8, 9] for walking without toe joints. in this work, we show that toe joints can significantly improve the walking behavior of a simulated nao robot and can be learned model-free. the remainder of this paper is organized as follows: section 2 gives an overview of the domain in which learning took place. section 3 explains the approach for model free learning with toes. section 4 contains empirical results for various behaviors trained before we conclude in section 5. the robots used in this work are robots of the robocup 3d soccer simulation which is based on simspark 1 and initially initiated by [10] . it uses the ode physics engine 2 and runs at an update speed of 50hz. the simulator provides variations of aldebaran nao robots with 22 dof for the robot types without toes and 24 dof for the type with toes, naotoe henceforth. more specifically, the robot has 6 (7) dof in each leg, 4 in each arm and 2 in its neck. there are several simplifications in the simulation compared to the real nao: all motors of the simulated nao are of equal strength whereas the real nao has weaker motors in the arms and different gears in the leg pitch motors. joints do not experience extensive backlash rotation axes of the hip yaw part of the hip are identical in both robots, but the simulated robot can move hip yaw for each leg independently, whereas for the real nao, left and right hip yaw are coupled the simulated naos do not have hands the touch model of the ground is softer and therefore more forgiving to stronger ground touches in the simulation energy consumption and heat is not simulated masses are assumed to be point masses in the center of each body part the feet of naotoe are modeled as rectangular body parts of size 8cm x 12cm x 2cm for the foot and 8cm x 4cm x 1cm for the toes (see figure 1 ). the two body parts are connected with a hinge joint that can move from -1 degrees (downward) to 70 degrees. all joints can move at an angular speed of at most 7.02 degrees per 20ms. the simulation server expects to get the desired speed at 50 hz for each joint. if no speeds are sent to the server it will continue movement of the joint with the last speed received. joint angles are noiselessly perceived at 50hz, but with a delay of 40ms compared to sent actions. so only after two cycles, the robot knows the result of a triggered action. a controller provided for each joint inside the server tries to achieve the requested speed, but is subject to maximum torque, maximum angular speed and maximum joint angles. the simulator is able to run 22 simulated naos in real-time on reasonable cpus. it is used as competition platform for the robocup 3d soccer simulation league 3 . in this context, only a single agent was running in the simulator. the following subsections describe how we approached the learning problem. this includes a description of the design of the behavior parameters used, what the fitness functions for the genetic algorithm look like, which hyperparameters were used and how the fitness calculation in the simspark simulation environment works exactly. the guiding goal behind our approach is to learn a model-free walk behavior. with model-free we depict an approach that does not make any assumptions about a robot's architecture nor the task to be performed. thus, from the viewpoint of learning, our model consists of a set of flat parameters. these parameters are later grounded inside the domain. the server requires 50 values per second for each joint. to reduce the search space, we make use of the fact that output values of a joint over time are not independent. therefore, we learn keyframes, i.e. all joint angles for discrete phases of movement together with the duration of the phase from keyframe to keyframe. the experiments described in this paper used four to eight of such phases. the number of phases is variable between learning runs, but not subject to learning for now, except for skipping phases by learning a zero duration for it. the robocup server requires robots to send the actual angular speed of each joint as a command. when only leg joints are included, this would require to learn 15 parameters per phase (14 joints + 1 for the duration of the phase), resulting in 60, 90 and 120 parameters for the 4, 6, 8 phases worked with. the disadvantage of this approach is that the speed during a particular phase is constant, thus making it unable to adapt to discrepancies between the desired and the actual motor movement. therefore, a combination of angular value and the maximum amount of angular speed each joint should have is used. the direction and final value of movement is entirely encoded in the angular values, but the speed can be controlled separately. it follows that: -if the amount of angular speed does not allow reaching the angular value, the joint behaves like in version 1. -if the amount of angular speed is bigger, the joint stops to move even if the phase is not over. this almost doubles the amount of parameters to learn, but the co-domain of values for the speed values is half the size, since here we only require an absolute amount of angular speed. with these parameters, the robot learns a single step and mirrors the movement to get a double step. feedback from the domain is provided by a fitness function that defines the utility of a robot. the fitness function subtracts a penalty for falling from the walked distance in x-direction in meters. there is also a penalty for the maximum deviation in y-direction reached during an episode, weighted by a constant factor. in practice, the values chosen for f allenp enalty and a factor f were usually 3 and 2 respectively. this same fitness function can be used without modification for forward, backward and sideward walk learning, simply by adjusting the initial orientation of the agent. the also trained turn behavior requires a different fitness function. f itness turn = (g * totalt urn) − distance (2) where totalt urn refers to the cumulative rotation performed in degrees, weighted by a constant factor g (typically 1/100). we penalize any deviation from the initial starting x / y position (distance) as incentive to turn in-place. it is noteworthy that other than swapping out the fitness function and a few more minor adjustments mentioned in 3. 3 , everything else about the learning setup remained the same thanks to the model-free approach. naturally, the fitness calculation for an individual requires connecting an agent to the simspark simulation server and having it execute the behavior defined by the learned parameters. in detail, this works as follows: at the start of each "episode", the agent starts walking with the old model-based walk engine at full speed. once 80 simulation cycles (roughly 1.5 seconds) have elapsed, the robot starts checking the foot force sensors. as soon as the left foot touches the ground, it switches to the learned behavior. this ensures that the learned walk has comparable starting conditions each time. if this does not occur within 70 cycles (which sometimes happens due to non-determinism in the domain and noise in the foot force perception), the robot switches anyway. from that point on, the robot keeps performing the learned behavior that represents a single step, alternating between the original learned parameters and a mirrored version (right step and left step). an episode ends once the agents has fallen or 8 seconds have elapsed. to train different walk directions (forward, backward, sideward), the initial orientation of the player is simply changed accordingly. in addition, the robot uses a different walk direction of the model-based walk engine for the initial steps that are not subject to learning. in case of training a morphing behavior (see 4.5) , the episode duration is extended to 12 seconds. when a morphing behavior should be trained, the step behavior from another learning run is used. this also means that a morphing behavior is always trained for a specific set of walk parameters. after 6 seconds, the morphing behavior is triggered once the foot force sensors detect that the left foot has just touched the ground. unlike the step / walk behavior, this behavior is just executed once and not mirrored or repeated. then the robot switches back to walking at full speed with the model-based walk engine. to maximize the reward, the agent has to learn a morphing behavior that enables the transition between learned model-free and old model-based walk to work as reliably as possible. finally, for the turn behavior, the robot keeps repeating the learned behavior without alternating with a mirrored version. in any case, if the robot falls, a training run is over. the overall runtime of each such learning run is 2.5 days on our hardware. learning is done using plain genetic algorithms. the following hyperparameters were used: more details on the approach can be found in [11] . this section presents the results for each kind of behavior trained. this includes three different walk directions, a turn behavior and a behavior for morphing. the main focus of this work has been on training a forward walk movement. figure 2 shows a sequence of images for a learned step. the best result reaches a speed of 1.3 m/s compared to the 1.0 m/s of our model-based walk and 0.96 m/s for a walk behavior learned on the nao robot without toes. the learned walk with toes is less stable, however, and shows a fall rate of 30% compared to 2% of the model-based walk. regarding the characteristics of this walk, it utilizes remarkably long steps 4 . table 1 shows an in-depth comparison of various properties, including step duration, length and height, which are all considerably bigger compared to our previous model-based walk. the forward leaning of the agent has increased by 80.4%, while 28.1% more time is spent with both legs off the ground. however, the maximum deviation from the intended path (maxy ) has also increased by 137.8%. table 1 : comparison of the previously fastest and the fastest learned forward walk once a working forward walk was achieved, it was natural to try to train a backward walk behavior as well, since this only requires a minor modification in the learning environment (changing the initial rotation of the agent and model-based walk direction to start with). the best backward walk learned reaches a speed of 1.03 m/s, which is significantly faster than the 0.67 m/s of its model-based counterpart. unfortunately, the agent also falls 15% more frequently. it is interesting just how backward-leaning the agent is during this walk behavior. it could almost be described as "controlled falling" 5 (see figure 3 ). sideward walk learning was the least successful out of the three walk directions. like with all directions, the agent starts out using the old walk engine and then switches to the learned behavior after a short time. in this case however, instead of continuing to walk sideward, the agent has learned to turn around and walk forward instead, see figure 4 . the resulting forward walk is not very fast and usually causes the agent to fall within a few meters 6 , but it is still remarkable that the learned behavior manages to both turn the agent around and make it walk forward with the same repeating step movement. it is also remarkable that the robot learned that it is quicker with the given legs at least for long distances to turn and run forward than to keep making sidesteps. with the alternate fitness function presented in section 3, the agent managed to learn a turn behavior that is comparable in speed to that of the existing walk engine. despite this, the approach is actually different: while the old walk engine uses small, angled steps 7 , the learned behavior uses the left leg as a "pivot", creating angular momentum with the right leg 8 . figure 5 shows the movement sequence in detail. unfortunately, despite the comparable speed, the learned turn behavior suffers from much worse stability. with the old turn behavior, the agent only falls in roughly 3% of cases, with the learned behavior it falls in roughly 55% of the attempts. one of the major hurdles for using the learned walk behaviors in a robocup competition is the smooth transition between them and other existing behaviors such as kicks. the initial transition to the learned walk is already built into the learning setup described in 3 by switching mid-walk, so it does not have to be given special consideration. more problematic is switching to another behavior afterwards without falling. to handle this, the robot simply attempted to train a "morphing" behavior using the same model-free learning setup. the result is something that could be described as a "lunge" (see figure 6 ) that reduces the forward momentum sufficiently to allow it to transition to the slower model-based walk when successful. 9 however, the morphing is not successful in about 50% of cases, resulting in a fall. we were able to successfully train forward and backward walk behaviors, as well as a morphing and turn behavior using plain genetic algorithms and a very flexible model-free approach. the usage of the toe joint in particular makes the walks look more natural and human-like than that of the model-based walk engine. however, while the learned behaviors outperform or at least match our old modelbased walk engine in terms of speed, they are not stable enough to be used during actual robocup 3d simulation league competitions. we think this is an inherent limitation of the approach: we train a static behavior that is unable to adapt to changing circumstances in the environment, which is common in simspark's non-deterministic simulation with perception noise. deep reinforcement learning seems more promising in this regard, as the neural network can dynamically react to the environment since sensor data serves as input. it is also arguably even less restrictive than the keyframe-based behavior parameterization we presented in this paper, as a neural network can output raw joint actions each simulation cycle. at least two other robocup 3d simulation league teams, fc portugal [8] and itandroids [9] , have had great success with this approach, everything points towards this becoming the state-of-the-art approach in robocup 3d soccer simulation in the near future, so we want to concentrate our future efforts here as well. retail companies dealing in alcoholic beverages are faced with a constant flux of products. apart from general product changes like modified bottle designs and sizes or new packaging units two factors are responsible for this development. the first is the natural wine cycle with new vintages arriving at the market and old ones cycling out each year. the second is the impact of the rapidly growing craft beer trend which has also motivated established breweries to add to their range. the management of the corresponding product data is a challenge for most retail companies. the reason lies in the large amount of data and its complexity. data entry and maintenance processes are linked with considerable manual effort resulting in high data management costs. product data attributes like dimensions, weights and supplier information are often entered manually into the data base and are often afflicted with errors. another widely used source of product data is the import from commercial data pools. a means of checking the data thus acquired for plausibility is necessary. sometimes product data is incomplete due to different reasons and a method to fill the missing values is required. all these possible product data errors lead to complications in the downstream automated purchase and logistics processes. we propose a machine learning model which involves domain specific knowledge and compare it a heuristic approach by applying both to real world data of a retail company. in this paper we address the problem of predicting the gross weight of product items in the merchandise category alcoholic beverages. to this end we introduce two levels of additional features. the first level consists of engineered features which can be determined by the basic features alone or by domain specific expert knowledge like which type of bottle is usually used for which grape variety. in the next step an advanced second level feature is computed from these first level features. adding these two levels of engineered features increases the prediction quality of the suggestion values we are looking for. the results emphasize the importance of careful feature engineering using expert knowledge about the data domain. feature engineering is the process of extracting features from the data in order to train a prediction model. it is a crucial step in the machine learning pipeline, because the quality of the prediction is based on the choice of features used to training. the majority of time and effort in building a machine learning pipeline is spent on data cleaning and feature engineering [domingos 2012] . a first overview of basic feature engineering principles can be found in [zheng 2018 ]. the main problem is the dependency of the feature choice on the data set and the prediction algorithm. what works best for one combination does not necessarily work for another. a systematic approach to feature engineering without expert knowledge about the data is given in [heaton 2016 ]. the authors present a study whether different machine learning algorithms are able to synthesize engineered features on their own. as engineered features logarithms, ratios, powers and other simple mathematical functions of the original features are used. in [anderson 2017 ] a framework for automated feature engineering is described. the data set is provided by a major german retail company and consists of 3659 beers and 10212 wines. each product is characterized by the seven features shown in table 1. the product name obeys only a generalized format. depending on the user generating the product entry in the company data base, abbreviating style and other editing may vary. the product group is a company specific number which encodes the product category -dairy products, vegetables or soft drinks for example. in our case it allows a differentiation of the product into beer and wine. additionally wines are grouped by country of origin and for germany also into wine-growing regions. note that the product group is no inherent feature like length, width, height and volume, but depends on the product classification system a company uses. the dimensions length, width, height and the volume derived by multiplicating them are given as float values. the feature (gross) weight, also given as a float value, is what we want to predict. as is often the case with real world data, a pre-processing step has to be performed prior to the actual machine learning in order to reduce data errors and inconsistencies. for our data we first removed all articles missing one or more of the required attributes of table 1. then all articles with dummy values were identified and discarded. dummy values are often introduced due to internal process requirements but do not add any relevant information to the data. if for example the attribute weight has to be filled for an article during article generation in order to proceed to the next step but the actual value is not know, often a dummy value of 1 or 999 is entered. these values distort the prediction model when used as training data in the machine learning step. the product name is subjected to lower casing and substitution of special german characters like umlauts. special symbolic characters like #,! or separators are also deleted. with this preprocessing done the data is ready to be used for feature engineering. following this formal data cleaning we perform an additional content-focused pre-processing. the feature weight is discretized by binning it with bin width 10g. volume is likewise treated with bin size 10ml. this simplifies the value distribution without rendering it too coarse. all articles where length is not equal to width are removed, because in these cases there are no single items but packages of items. often the data at hand is not sufficient to train a meaningful prediction model. in these cases feature engineering is a promising option. identifying and engineering new features depends heavily on expert knowledge of the application domain. the first level consists of engineered features which can be determined by the original features alone. in the next step advanced second level features are computed from these first level and the original features. for our data set the original features are product name and group as well as the dimensions length, width, height and volume. we see that the volume is computed in the most general way by multiplication of the dimensions. geometrically this corresponds to all products being modelled as cuboids. since angular beer or wine bottles are very much the exception in the real world, a sensible new feature would be a more appropriate modelling of the bottle shape. since weight is closely correlated to volume, the better the volume estimate the better the weight estimate. to this end we propose four first level engineered features: capacity, wine bottle type, beer packaging type and beer bottle type which are in turn used to compute a second level engineered feature namely the packaging specific volume. figure 1 shows all discussed features and their interdependencies. let us have a closer look at the first level engineered features. the capacity of a beverage states the amount of liquid contained and is usually limited to a few discrete values. 0.33l and 0.5l are typical values for beer cans and bottles while wines are almost exclusively sold in 0.75l bottles and sometimes in 0.375l bottles. the capacity can be estimated from the given volume with sufficient certainty using appropriate threshold values. outliers were removed from the data set. there are three main beer packaging types in retail: cans, bottles and kegs. while kegs are mainly of interest to pubs and restaurants and are not considered in this paper, cans and bottles target the typical super market shopper and come in a greater variety. in our data set, the product name in case of beers is preceded by a prefix denoting whether the product is packaged in a can or a bottle. extracting the relevant information is done using regular expressions. not, though, that the prefix is not always correct and needs to be checked against the dimensions. the shapes of cans are the same for all practical purposes, no matter the capacity. the only difference is in their wall thickness, which depends on the material, aluminium and tin foil being the two common ones. the difference is weight is small and the actual material used is impossible to extract from the data. a further distinction for cans in different types like for beer and wine is therefore unnecessary. regarding the german beer market, the five bottle types shown in figure 2 the engineered feature beer packaging type assigns each article identified as beer by its product group to one of the classes bottle or can. the feature beer bottle type contains the most probably member of the five main beer bottle types. packages containing more than one bottle or can like crates or six packs are not considered in this paper and were removed from the data set. compared to beer the variety of commercially sold wine packagings is limited to bottles only. a corresponding packaging type attribute to distinguish between cans and bottles is not necessary. again there are a few bottle types which are used for the majority of wines, namely schlegel, bordeaux and burgunder ( figure 3 ). deciding what product is filled in which bottle type is a question of domain knowledge. the original data set does not contain a corresponding feature. from the product group the country of origin and in the case of german wines the region can be determined via a mapping table. this depends on the type of product classification system the respective company uses and has not to be valid for all companies. our data set uses a customer specific classification with focus on germany. a more general one would be the global product classification (gpc) standard for example. to determine wine growing regions in non-german countries like france the product name has to be analyzed using regular expressions. the type of grape is likewise to be deduced from the product name if possible. using the country and specifically the region of origin and type of grape of the wine in question is the only way to assign a bottle type with acceptable certainty. there are countries and region in which a certain bottle type is used predominantly, sometimes also depending on the color of the wine. the schlegel bottle, for example, is almost exclusively used for german and alsatian white wines and almost nowhere else. bordeaux and burgunder bottles on the other hand are used throughout the world. some countries like california or chile use a mix of bottle types for their wines, which poses an additional challenge. with expert knowledge one can assign regions and grape types to the different bottle types. as with beer bottles this categorization is by no means comprehensive or free of exceptions but serves as a first step. the standard volume computation by multiplying the product dimensions length, width and height is a rather coarse cuboid approximation to the real shape of alcoholic beverage packagings. since the volume is intrinsically linked to the weight which we want to predict a packaging type specific volume computation is required for cans and especially bottles. the modelling of a can is straightforward using a cylinder with the given height ℎ and a diameter of the given width and length . thus the packaging type specific volume is: a bottle on the other hand needs to be modelled piecewise. its height can be divided into three parts: base, shoulders and neck as shown in figure 4. base and neck can be modeled by a cylinder. the shoulders are approximated by a truncated cone. with the help of the corresponding partial heights ℎ , ℎ ℎ and ℎ we can compute coefficients , ℎ and as fractions of the overall height ℎ of the bottle. the diameters of the bottle base and the neck opening are given by and and are likewise used to compute the ratio . since bottles have circular bases, the values for width and length in the original data have to be the same and either one may be used for . these four coefficients are characteristic for each bottle type, be it beer or wine (table 3) . with their help, a bottle type specific volume from the original data length, width and height can be computed which is a much better approximation to the true volume than the former cuboid model. the bottle base can be modelled as a cylinder as follows: the bottle shoulders have the form of a truncated cone and are described by formula 3: the bottle neck again is a simple cylinder: summing up all three sections yields the packaging type specific volume for bottles: ur-ai 2020 // 18 the experiments follow the multi-level feature engineering scheme as shown in figure 1 . first, we use only the original features product group and dimensions. then we add the first level engineered features capacity and bottle type to the basic features. next the second level engineered feature packaging type specific volume is used along with the basic features. finally all features from every level are used for the prediction. after pre-processing and feature engineering the data set size is reduced from 3659 to 3380 beers and from 10212 to 8946 wines. for prediction of the continuous valued attribute gross weight, we use and compare several regression algorithms. both the decision-tree based random forests algorithm (breimann, 2001) and support vector machines (svm) (cortes, 1995) are available in regression mode (smola, 1997) . linear regression (lai, 1979) and stochastic gradient descent (sgd) (taddy, 2019) are also employed as examples of more traditional statics-based methods. our baseline is a heuristic approach taking the median of the attribute gross weight for each product group and use this value as a prediction for all products of the same product group. practical experience has shown this to be a surprisingly good strategy. the implementation was done in python 3.6 using the standard libraries sk-learn and pandas. all numeric features were logarithmized prior to training the models. the non-numeric feature bottle type was converted to numbers. the final results were obtained using tenfold cross validation (kohavi, 1995) . for model training 80% of the data was used while the remaining 20% constituted the test data. we used the root mean square error (rsme) (6) as well as the mean and variance of the absolute percentage error (7) as metrics for the evaluation of the performance of the algorithms. all machine learning algorithms deliver significant improvements regarding the observed metrics compared to the heuristic median approach. the best results for each feature combination are highlighted in bold script. the results for the beer data set in table 4 show that the rsme can be more than halved, the mean almost be reduced to a third and the variance of quartered compared to the baseline approach. the random forest regressor achieves the best results in terms of rsme and for almost all feature combinations except basic features and basic features combined with the packaging type specific volume, in which cases support vector machines prove superior. linear regression and sgd are are still better than the baseline approach but not on par with the other algorithms. linear regression shows the tendency to improved results when successively adding features. sgd on the other hand exhibits no clear relation between number and level of features and corresponding prediction quality. a possible cause could be the choice of hyper parameters. sgd is very sensitive in this regard and depends more heavily upon a higher number of correctly adjusted hyper parameters than the other algorithms we used. random forests is a method which is very well suited to problems, where there is no easily discernable relation between the features. it is prone to overfitting, though, which we tried to avoid by using 20% of all data as test data. adding more engineered features leads to increasingly better results using random forest with an outlier for the packaging type specific volume feature. svm are not affected by only first level engineered features but profit from using the bottle type specific volume. regarding the wine data set the results depicted in table 5 are not as good as for the beer data set though still much better than the baseline approach. a reduction of the rsme by over 29% and of the mean by almost 50% compared to the baseline were achieved. the variance of could even be limited to under 10% of the baseline value. again random forests is the algorithm with the best metrics. linear regression and svm are comparable in terms of while sgd is worse but shows good rsme values. in conclusion the general results of the wine data set show not much improvement when applying additional engineered features. 6 discussion and conclusion the experiments show a much better predicting quality for beer than for wine. a possible cause could be the higher weight variance in bottle types compared to beer bottles and cans. it is also more difficult to correctly determine the bottle type for wine, since the higher overlap in dimensions does not allow to compute the bottle type with the help of idealized bottle dimensions. using expert knowledge to assign the bottle type by region and grape variety seems not to be as reliable, though. especially with regard to the lack of a predominant bottle type in the region with the most bottles (red wine from baden for example), this approach should be improved. especially bordeaux bottles often sport an indentation in the bottom, called a 'culot de bouteille'. the size and thickness of this indentation cannot be inferred from the bottle's dimensions. this means that the relation between bottle volume and weight is skewed compared to other bottles without these indentations, which in turn decreases prediction quality. predicting gross weights with machine learning and domain-specifically engineered features leads to smaller discrepancies than using simple heuristic approaches. this is important for retail companies since big deviations are much worse for logistical reasons than small ones which may well be within natural production tolerances for bottle weights. our method allows to check manually generated as well as data pool imported product data for implausible gross weight entries and proposes suggestion values in case of missing entries. the method we presented can easily be adapted to non-alcoholic beverages using the same engineered features. in this segment, plastics bottles are much more common than glass ones and hence the impact of the bottle weight compared to the liquid weight is significantly smaller. we assume that this will cause a smaller importance of the bottle type feature in the prediction. a more problematic kind of beverage is liquor. although there are only a few different standard capacities, the bottle types vary so greatly, that identifying a common type is almost impossible. one of the main challenges of our approach is determining the correct bottle types. using expert knowledge is a solid approach but cannot capture all exemptions. especially if a wine growing region has no predominant bottle type and is using mixed bottle types instead. additionally many wine growers use bottle types which haven't been typical for their wine types because they want to differ from other suppliers in order to get the customer's attention. assuming that all rieslings are sold in schlegel bottles, for example, is therefore not exactly true. one option could be to model hybrid bottles using a weighted average of the coefficients for each bottle type in use. if a region uses both burgunder and bordeaux bottles with about equal frequency, all products from this region could be assigned a hybrid bottle with coefficients computed by the mean value of each coefficient. if an initially bottle type labeled data set is available, preliminary simulations have shown that most bottle types can be predicted robustly using classification algorithms. the most promising strategy, in our opinion, is to learn the bottle types directly from product images using deep neural nets for example. with regard to the ever increasing online retail sector, web stores need to have pictures of their products on display, so the data is there to be used. quality assurance is one of the key issues for modern production technologies. especially new production methods like additive manufacturing and composite materials require high resolution 3d quality assurance methods. computed tomography (ct) is one of the most promising technologies to acquire material and geometry data non-destructively at the same time. with ct it is possible to digitalize subjects in 3d, also allowing to visualize their inner structure. a 3d-ct scanner produces voxel data, comprising of volumetric pixels that correlate with material properties. the voxel value (grey value) is approximately proportional to the material density. nowadays it is still common to analyse the data by manually inspecting the voxel data set, searching for and manually annotating defects. the drawback is that for high-resolution ct data, this process it very time consuming and the result is operator-dependent. therefore, there is a high motivation to establish automatic defect detection methods. there are established methods for automatic defect detection using algorithmic approaches. however, these methods show a low reliability in several practical applications. at this point artificial neural networks come into play that have been already implemented successfully in medical applications [1] . the most common networks, developed for medical data segmentation, are by ronneberger et al., the u-net [2] and by milletari et al., the v-net [3] and their derivates. these networks are widely used for segmentation tasks. fuchs et al. describes three different ways of analysing industrial ct data [4] . one of these contains a 3d-cnn. this cnn is based on the u-net architecture and is shown in their previous paper [5] . the authors enhance and combine the u-net and v-net architecture to build a new network for examination of 3d volumes. in contrast, we investigate in our work how the networks introduced by ronneberger et al. and milletari et al. perform in industrial environments. furthermore, we investigate if derivates of these architectures are able to identify small features in industrial ct data. in industrial ct systems, not only in the hardware design but also in the resulting 3d imaging data differs from medical ct systems. voxel data from industrial parts differ from medical data in the contrast level and the resolution. state-of-the-art industrial ct scanner produce one to two order of magnitude larger data sets compared to medical ct systems. the corresponding resolution is necessary to resolve small defects. medical ct scanners are optimised for a low xray dose for the patient, the energy of x-ray photons are typically up to 150 kev, industrial scanner typically use energies up to 450 kev. in combination with the difference of the scan "object", the datasets differ significantly in size and image content. to store volume data there are a lot of different file formats. some of them are mainly used in medical applications like dicom [6] , nifti 1 or raw. in industrial applications vgl 3 , raw and tiff 4 are commonly used. also depending on the format, it is possible to store the data slice wise or as a complete volume stack. industrial ct data, as mentioned in previous section, has some differences to medical ct data. one aspect is the size of the features to be detected or learned by the neural network. our target is to find defects in industrial parts. as an example, we analyse pores in casting parts. these features may be very small, down to 1 to 7 voxels in each dimension. compared to the size of the complete data volume (typically larger than 512 x 512 x 512 voxel), the feature size is very small. the density difference between material and pores may be as low as 2% of the maximum grey value. thus, it is difficult to annotate the data even for human experts. the availability of real industrial data of good quality, annotated by experts, is very low. most companies don't reveal their quality analysis data. training a neural network with a small quantity of data is not possible. for medical applications, especially ai applications, there are several public datasets available. yet these datasets are not always sufficient and researchers are creating synthetic medical data [7] . therefore, we decided to create synthetic industrial ct data. another important reason for synthetic data is the quality of annotations done by human experts. the consistency of results is not given for different experts. fuchs et al. have shown that training on synthetic data and predicting on real data lead to good results [4] . however, synthetic data may not reflect all properties of real data. some of the properties are not obvious, which may lead to ignoring some varieties in the data. in order to achieve a high associability, we use a large numbers of synthetic data mixed with a small number of real data. to achieve this, we developed an algorithm which generates large amounts of data, containing a large variation of aspects, needed to generalize a neural network. the variation includes material density, pore density, pore size, pore amount, pore shape and size of the part. there are some samples that could be learned easily, because the pores are clearly visible inside the material. however, some samples are more difficult to be learned, because the pores are nearly invisible. this allows us to generate data with a wide variety and hence the network can predict on different data. to train the neural networks, we can mix the real and synthetic data or use them separately. the real data was annotated manually by two operators. to create a dataset of this volume we sliced it into 64x64x64 blocks. only the blocks with a mean density greater than 50% of the grayscale range are used, to avoid too much empty volumes in the training data. another advantage of synthetic data is the class balance. we have two classes, where 0 corresponds to material and surrounding air and 1 for the defects. because of the size of the defects there is a high imbalance between the classes. by generating data with more features than in the real data, we could reduce the imbalance. reducing the size of the volume to 64x64x64 also leads to better balance between the size of defects compared to full volume. in table 1details of our dataset for training, evaluation and testing are shown. the synthetic data will not be recombined to a larger volume as they represent separate small components or full material units. the following two slices of real data ( figure 1 ) and synthetic data (figure 2 ) with annotated defects show the conformity between the data. ur-ai 2020 // 26 hardware and software setup deep learning (dl) consist of two phases: the training and its application. while dl models can be executed very fast, the training of the neural network can be very time-consuming, depending on several factors. one major factor is the hardware. the time consumed can be reduced by the factor of around ten when graphics cards (gpus) are used. [8] to cache the training data, before it is given into the model, calculated on the gpu, a lot of random-access memory (ram) is used [9] [10] [11] . our system is built on a dual cpu hardware with 10 cores each running at 2.1 ghz and a nvidia gpu titan rtx 5 with 24gb of vram and 64gb of regular ram. all measurements in this work concerning training and execution time are related to this hardware setup. the operating system is ubuntu 18.4lts. anaconda is used for python package management and deployment. the dl-framework is tensorflow 6 2.1 and keras as a submodule in python 7 . based on the 3du-net [12] and 3dv-net [3] architecture compared from paichao et al. [13] we created modified versions which differ in number of layers and their hyperparameters. due to the small size of our data, no patch division is necessary. instead the training is performed on the full volumes. we actually do not use the z-net enhancement proposed in their paper. the input size, depending on our data, is defined to 64x64x64x1 with 1 dimension for channel. the incoming data will be normalized. as we have a binary segmentation task, our output activation is the sigmoid [14] function. based on paichao et al. [13] the convolutional layer of our 3du-nets have a kernel size of (3, 3, 3) and the 3dv-nets have a kernel size of (5, 5, 5). as convolution activation function we are using elu [14] [15] and he_normal [16] as kernel initialization [17] . the adam optimisation method [18] [19] is used with a starting learning rate of 0.0001, a decay factor of 0.1 and the loss function is the binary cross-entropy [20] . figure 3 shows a sample 3du-net architecture where downwards max pooling and upwards transposed convolution are used. compared to figure 4 , the 3dv-net, where we have a fully convolutional neural network, the descend is done with a (2, 2, 2) convolution and a stride of 2 and ascent with transposed convolution. it also has a layer level addition of the input of this level added to the last convolution output of the same level, as marked by the blue arrows. to adapt the shapes of the tensors for adding them, the down-convolution and the last convolution of the same level, have to have the same number of kernel filters. our modified neural network differ in the levels of de-/ascending, the convolution filter kernel size and their hyperparameters, shown in table 2 . the convolutions on one level have the same number of filter kernel. after every down convolution the number of filters is multiplied by 2 and on the way up divided by 2. training and evaluation of the neural networks the conditions of a training and a careful parameters selection is important. in table 3 the training conditions fitted to our system and networks are shown. we are also taking into account that different network architectures and number of layers are better performing on different learning rates, batch size, etc. to evaluate our trained models, we are mainly focusing on the iou metric, also called jackard index, which is the intersection over union. this metric is widely used for segmentation tasks and compares the intersection over union between the prediction and ground truth for each voxel. the value of iou range between 0 and 1, whereas the loss values range between 0 and infinite. therefore, the iou is a much clearer indicator. an iou close to 1 indicates a high intersectionprecision between the prediction and the groundtruth. our networks where trained between 30 and 90 epochs until no more improvement could be achieved. both datasets consist of a similar number of samples, which means the epoch time is equivalent. one epoch took around 4 minutes. figure 5 shows the loss determined based on the evaluation data. as described in fehler! verweisquelle konnte nicht gefunden werden., all models are trained on and evaluated against the synthetic dataset gdata and on the mixed dataset mdata. in general, the loss achieved by all models is higher on mdata because the real data is harder to learn. a direct comparison between the models is only possible between models with the same architecture. the iou metric shown in figure 6 . here the evaluation is sorted based on the iou metric. if we compare the loss of unet-mdata with unet-gdata, which are nearly the same for mdata, with their corresponding iou (unet-mdata (~0.8) and unet-gdata (~0.93)), we can see that a lower loss does not necessarily lead to higher iou score. if only the loss and iou are considered, the unets tend to be better than the vnets. as a conclusion, considering the iou metric for model selection, the unet-gdata is the best performing model and vnet-gdata the least performing. the evaluation loss determined based on the evaluation data sorted from lowest to highest. the evaluation iou determined based on the evaluation data sorted from lowest to highest. after comparing the automatic evaluation, we show prediction samples of different models on real and synthetic data ( table 4) . rows 1 and 2 show the comparison between unet-gdata and vnet-gdata, predicting on a synthetic test sample. the result of unet-gdata exactly hits the groundtruth, whereas the vnet-gdata prediction has a 100% overlap to the groundtruth but with surrounding false positive segmentations. in row 3 and 4 both models predict the groundtruth plus some false positive segmentations in the close neighbourhood. in row 5 and 6 the prediction results of the same two models on real data is shown, taking into account that both models are not trained on real data. unet-gdata delivers a good precision with some false positive segmentations in thegroundtruth area and one additional segmented defect. this shows that the model was able to find a defect which was missed by the expert. vnet-gdata shows a very high number of false positive segmentations. in this paper, we have proposed a neural network to find defects in real and synthetic industrial ct volumes. we have shown that neural networks, developed for medical applications can be adapted to industrial applications. to achieve high accuracy, we used a large variety of features in our data. based on the evaluation and manually reviewing random samples we have chosen the unet architecture for further research. this model achieved great performance on our real and synthetic dataset. in summery this paper shows that the artificial intelligence and their neural networks will take an import enrichment in industrial issues. stress can affect all aspects of our lives, including our emotions, behaviors, thinking ability, and physical health, making our society sick -both mentally and physically. among the effects that the stress and anxiety can cause are heart diseases, such as coronary heart disease and heart failure [5] . due this information, this research will present a proposal to help people handling stress using the benefit of technology development and to set patters of stress status as way to propose some intervention, once the first step to controlling stress is to know the symptoms of stress. the stress symptoms are very board and can be confused with others diseases according the american institute of stress [15] , for example the frequent headache, irritability, insomnia, nightmares, disturbing dreams, dry mouth, problems swallowing, increased or decreased appetite, or even cause other diseases such as frequent colds and infections. in view of the wide variety of symptoms caused by stress, this research intends to define, through physiological signals, the patterns generated by the body and obtained by wearable sensors and develop a standardized database to apply the machine learning. hand, advances in sensor technology, wearable devices and mobile growth would help to online stress identification based on physiological signals and delivery of psychological interventions. currently with the advancement of technology and improvements in the wearable sensors area, made it possible to use these devices as a source of data to monitor the user's physiological state. the majority of the wearable devices consist of low-cost board that can be used to the acquisition of physiological signals [1, 10] . after the data are obtained it is necessary apply some filters to clear signal, without noise or distortions aiming to use some machine learning approaches to model and predict these stress states [2, 11] . the wide-spread use of mobile devices and microcomputers, as raspberry pi, and its capabilities presents a great possibility to collect, and process those signs with an elaborated application. these devices can collect the physiological signals and detect specific stress states to generate interventions following the predetermined diagnosis based on the standards already evaluated in the system [9, 6] . during the literature review it was evident the presence of few works dedicated to evaluating comprehensively the complete cycle of biofeedback, which comprises using the wearable devices, applying machine learning patterns detection algorithms, generate the psychologic intervention, besides monitoring its effects and recording the history of events [9, 3] . stress is identified by professionals using human physiology, so wearables sensors could help on data acquisition and processing, through machine learning algorithms on biosignal data, suggesting psychological interventions. some works [6, 14] are dedicated to define patterns as experiment for data acquisition simulation real situations. jebelli, khalili and lee [6] showed a deep learning approach that was used to compare with a baseline feedforward artificial neural network. schmidt et al. [12] describes wearable stress and affect detection (wesad), one public dataset used to set classifiers and identify stress patterns integrating several sensors signals with the emotion aspect with a precision of 93% in the experiments. the work of gaglioli et al. [4] describe the main features and preliminary evaluation of a free mobile platform for the selfmanagement of psychological stress. in terms of the wearables, some studies [13, 14] evaluate the usability of devices to monitory the signals and the patient's well-being. pavic et al. [13] showed a research performed to monitor cancer patients remotely and as the majority of the patients have a lot of symptoms but cannot stay at hospital during all treatment. the authors emphasize that was obtained good results and that this system is viable, as long as the patient is not a critical case, as it does not replace medical equipment or the emergency care present in the hospital. henriques et al. [5] focus was to evaluated the effects of biofeedback in a group of students to reduce anxiety, in this paper was monitored the heart rate variability with two experiments with duration of four weeks each. the work of wijman [8] describes the use of emg signals to identify stress, this experiment was conducted with 22 participants, evaluating both the wearables signals and questionnaires. in this section will be described the uniqueness of this research and the devices that was used. this solution is being proposed by several literature study about stress patterns and physiological aspects but with few results, for this reason, our project will address topics like experimental study protocol on signals acquisition from patients/participants with wearables to data acquisition and processing, in sequence will be applied machine learning modeling and prediction on biosignal data regarding stress (fig. 1) . the protocol followed to the acquisition of signals during all different status is the trier social stress test (tsst) [7] , recognized as the gold standard protocol for stress experiments. the estimated total protocol time, involving pre-tests and post-tests, is 116 minutes with a total of thirteen steps, but applied experiment was adapted and it was established with ten stages: initial evaluation: the participant arrives, with the scheduled time, and answer the questionnaires; habituation: it will take a rest time of twenty minutes before the pre-test to avoid the influence of events and to establish a safe baseline of that organism; pre-test: the sensors will be allocated ( fig. 2 ), collected saliva sample and applied the psychological instruments. the next step is explanation of procedure and preparation: the participant reads the instructions and the researcher ensures that he understands the job specifications, in sequence, he is sent to the room with the jurors (fig. 3) , composed of two collaborators of the research, were trained to remain neutral during the experiment, not giving positive verbal or non-verbal feedback; free speech: after three minutes of preparation, the participant is requested to start his speech, being informed that he cannot use the notes. this will follow the arithmetic task: the jurors request an arithmetic task in which the participant must subtract mentally, sometimes, the jurors interrupt and warn that the participant has made a mistake; post-test evaluation: the experimenter receives the subject outside the room for the post-test evaluations; feedback and clarification: the investigator and jurors talk to the subject and clarify what the task was about; relaxation technique: a recording will be used with the guidelines on how to perform a relaxation technique, using only the breathing; final post-test: some of the psychological instruments will be reapplied, saliva samples will be collected, and the sensors will still be picking up the physiological signals. based on literature [14] and wearable devices available the signals that was selected to analysis is the ecg, eda and emg for an initial experiment. this experimental study protocol on data acquisition started with 71 participants, where data annotation each step was done manually, from protocol experiment, preprocessing data based on features selection. in the machine learning step, it was evaluated the metrics of different algorithms as decision tree, random forest, adaboost, knn, k-means, svm. the experiment was made using the bitalino kit -plux wireless biosignals s.a. (fig. 4 ) composed by ecg sensor, which will provide data on heart rate and heart rate variability; eda sensor that will allow measure the electrical dermal activity of the sweat glands; emg sensor that allows the data collect the activity of the muscle signals. this section will describe the results in the pre-processing step and how it was made, listing all parts regarded to categorization and filtering data, evaluating the signal to know if it has plausibility and create a standardized database. the developed code is written in python due to the wide variety of libraries available, in this step was used the libraries numpy and pandas, both used to data manipulation and analysis. in the first step it is necessary read the files with the raw data and the timestamp, during this process the used channels are renamed to the name of the signal, because the bitalino store the data with the channel number as name of each signals. in sequence, the data timestamp is converted to a useful format, with goal to compare with the annotations, after time changed to the right format all channels unused are discarded to avoid unnecessary processing. the next step is to read the annotations taken manually in the experiment, as said before, to compare the time and classify each part of the experiment with its respective signal. after all signals are classified with its respective process of the tsst, each part of the experiment is grouped in six categories, which will be analyzed later. the first category is the "baseline", with just two parts of the experiment, representing the beginning of the experiment, when the participants had just arrived. the second is called of "tsst" comprises the period in which the participant spoke, the third category is the "arithmetic" with the data in acquired in the arithmetic test. the others two relevant categories are the "post_test_sensors_1" and "post_test_sensors_2", with its respective signals in the parts called with the same name. every other part of the experiment was categorized as "no_category", in sequence, this category is discarded in function of it will not be necessary in the machine learning stage. after the dataframe is right with all signals properly classified, the columns with the participants number and the timestamp are removed of the dataframe. the next step is evaluated the signal, to verify if the signal is really useful in the process of machine learning. for this, it is analyzed the signals using the biosppy library, which performs the data filtering process and makes it possible to view the data. finally, the script checks the volume of data present in each classification and returns the value of the smallest category. this is done because it was found that the categories have different volumes of data, which would become a problem in the machine learning stage, by offering more data from a determinate category than from the others. due this fact, the code analyzes the others categories and reduce its size until all categories stay with the same number of rows in each category (); after this the dataframe is exported in a csv file, to be read in the machine learning stage. the purpose of this article is to describe some stages of the development of a system for the acquisition and analysis of physiological signals to determine patterns in these signals that would detect stress states. during the development of the project was verified that there are data gaps in the dataframe in the middle of the experiment in some participants; a hypothesis about the motivation of this had happened is the sampling of the acquisition of bitalino regarding communication issues in some specifics sampling rates. it evaluate the results obtained when reducing this acquisition rate, however, it is necessary to carefully evaluate the extent to which the reduction in the sampling rate will interfere with the results. during the evaluation of the plausibility of the signals, it was verified that there are evident differences between the signals patterns in the different stages of the process, thus validating the protocol followed in the acquisition of the standards. the next step in this project is implement the machine learning stage, applying different algorithms as svm, decision tree, random forest, adaboost, knn and k-means; besides to evaluate the results using metrics like accuracy, precision, recall and f1. the next steps of this research will support the confirmation of the hypothesis raised about being able to define patterns of physiological signals to detect stress states. from the definition of the patterns, a system can be applied that identifies the acquisition of the signals and, in real time, performs the analysis of these data based on the machine learning results. therefore we can detect the state of the person and that the psychologist can indicate a proposal intervention and monitor whether the decrease is occurring. technological developments have been influencing all kinds of disciplines by transferring more competences from human beings to technical devices. the steps inculde [1]: 1. tools: transfer of mechanics (material) from the human being to the device 2. machines: transfer of energy from the human being to the device 3. automatic machines 1 : transfer of information from the human being to the device 4. assistants: transfer of decisions from the human being to the device with the introduction of artificial intelligence (ai), in particular its latest developments in deep learning, we let the system (in step 4) take over our decisions and creation processes. thus, tasks and disciplines that were exclusively reserved for humans in the past can now co-exist or even take the human out of the loop. it is no wonder that this transformation is not stopped at disciplines such as engineering, business, agriculture but also affects humanities, art and design. each new technology has been adopted for artistic expression-just see the many wonderful examples in media art. therefore, it is not surprising, that ai is going to be established as a novel tool to produce creative content of any form. however, in contrast to other disruptive technologies, ai seems particular challenging to be accepted in the area of art because it offers capabilities we thought once only humans are able to perform-the art is no longer done by artists using new technology to perform their art, but the art is done by the machine itself without the need for a human to intervene. the question of "what is art" has always been an emotionally debated topic in which everyone has a slightly different definition depending on his or her own experiences, knowledge base and personal aesthetics. however, there seems to be a broad consensus that art requires human creativity and imagination as, for instance, stated by the oxford dictionary "the expression or application of human creative skill and imagination, typically in a visual form such as painting or sculpture, producing works to be appreciated primarily for their beauty or emotional power." every art movement challenges old ways and uses artistic creative abilities to spark new ideas and styles. with each art movement diverse intentions and reasons for creating the artwork came along with critics who did not want to accept the new style as an artform. with the introduction of ai into the creation process another art movement is trying to be established which is fundamentally changing the way we see art. for the first time, ai has the potential to take the artist out of the loop, to leave humans only in the positions of curators, observers and judges to decide if the artwork is beautiful and emotionally powerful. while there is a strong debate going on in the arts if creativity is profoundly human, we investigate how ai can foster inspiration, creativity and produce unexpected results. it has been shown by many publications that ai can generate images, music and the like which can resemble different styles and produce artistic content. for instance, elgammal et al. [2] have used generative adversarial networks (gan) to generate images by learning about styles and deviating from style norms. the promise of ai-assisted creation is "a world where creativity is highly accessible, through systems that empower us to create from new perspectives and raise the collective human potential" as roelof pieters and samim winiger pointed out [3] . to get a better understanding of the process on how ai is capable to propose images, music, etc. we have to open the black box to investigate where and how the magic is happening. random variations in the image space (sometimes also referred to as pixel space) are usually not leading to any interesting result. this is because semantic knowledge cannot be applied. therefore, methods need to be applied which constrain the possible variations of the given dataset in a meaningful way. this can be realized by generative design or procedural generation. it is applied to generate geometric patterns, textures, shapes, meshes, terrain or plants. the generation processes may include, but are not limited, to self-organization, swarm systems, ant colonies, evolutionary systems, fractal geometry, and generative grammars. mccormack et al. [4] review some generative design approaches and discuss how art and design can benefit from those applications. these generative algorithms which are usually realized by writing program code are very limited. ai can change this process into data-driven procedures. ai, or more specifically artificial neural networks, can learn patterns from (labeled) examples or by reinforcement. before an artificial neural network can be applied to a task (classification, regression, image reconstruction), the general architecture is to extract features through many hidden layers. these layers represent different levels of abstractions. data that have a similar structure or meaning should be represented as data points that are close together while divergent structures or meanings should be further apart from each other. to convert the image back (with some conversion/compression loss) from the low dimensional vector, which is the result of the first component, to the original input an additional component is needed. together they form the autoencoder which consists of the encoder and the decoder . the encoder compresses the data from a high dimensional input space to a low dimensional space, often called the bottleneck layer. then, the decoder takes this encoded input and converts it back to the original input as closely as possible. the latent space is the space in which the data lies in the bottleneck layer. if you look at figure 1 you might be wondering why a model is needed that converts the input data into a "close as possible" output data. it seems rather useless if all it outputs is itself. as discussed, the latent space contains a highly compressed representation of the input data, which is the only information the decoder can use to reconstruct the input as faithfully as possible. the magic happens by interpolating between points and performing vector arithmetic between points in latent space. these transformations result in meaningful effects on the generated images. as dimensionality is reduced, information which is distinct to each image is discarded from the latent space representation, since only the most important information of each image can be stored in this low-dimensional space. the latent space captures the structure in your data and usually offers some semantic meaningful interpretation. this semantic meaning is, however, not given a priori but has to be discovered. as already discussed autoencoders, after learning a particular non-linear mapping, are capable of producing photo-realistic images from randomly sampled points in the latent space. the latent space concept is definitely intriguing but at the same time non-trivial to comprehend. although latent space means hidden, understanding what is happening in latent space is not only helpful but necessary for various applications. exploring the structure of the latent space is both interesting for the problem domain and helps to develop an intuition for what has been learned and can be regenerated. it is obvious that the latent space has to contain some structure that can be queried and navigated. however, it is non-obvious how semantics are represented within this space and how different semantic attributes are entangled with each other. to investigate the latent space one should favor a dataset that offers a limited and distinctive feature set. therefore, faces are a good example in this regard because they share features common to most faces but offer enough variance. if aligned correctly also other meaningful representations of faces are possible, see for instance the widely used approach of eigenfaces [5] to describe the specific characteristic of faces in a low dimensional space. in the latent space we can do vector arithmetic. this can correspond to particular features. for example, the vector a smiling woman representing the face of a smiling woman minus the vector a neutral woman representing a neutral looking woman plus the vector a neutral man representing a neutral looking man resulted in the vector a smiling man representing a smiling man. this can also be done with all kinds of images; see e.g. the publication by radford et al. [6] who first observed the vector arithmetic property in latent space. a visual example is given in figure 2 . please note that all images shown in this publication are produced using biggan [7] . the photo of the author on which most of the variations are based on is taken by tobias schwerdt. in latent space, vector algebra can be carried out. semantic editing requires to move within the latent space along a certain 'direction'. identifying the 'direction' of only one particular characteristic is non-trivial since editing one attribute may affect others because they are correlated. this correlation can be attributed to some extent to pre-existing correlations in 'the real world' (e.g. old persons are more likely to wear eyeglasses) or bias in the training dataset (e.g. more women are smiling on photos than men). to identify the semantics encoded in the latent space shen et al. proposed a framework for interpreting faces in latent space [8] . beyond the vector arithmetic property, their framework allows decoupling some entangled attributes (remember the aforementioned correlation between old people and eyeglasses) through linear subspace projection. shen et al. found that in their dataset pose and smile are almost orthogonal to other attributes while gender, age, and eyeglasses are highly correlated with each other. disentangled semantics enable precise control of facial attributes without retraining of any given model. in our examples, in figures 3 and 4 , faces are varied according to gender or age. it has been widely observed that when linearly interpolate between two points in latent space the appearance of the corresponding synthesized images 'morphs' continuously from one face to another; see figure 5 . this implies that also the semantic meaning contained in the two images changes gradually. this is in stark contrast to having a simple fading between two images in image space. it can be observed that the shape and style slowly transform from one image into the other. this demonstrates how well the latent space understands the structure and semantics of the images. other examples are given in section 3. even though our analysis has focused on face editing for the reasons discussed earlier it holds true also for other domains. for instance, bau et al. [9] generated living rooms using similar approaches. they showed that some units from intermediate layers of the generator are specialized to synthesize certain visual concepts such as sofas or tvs. so far we have discussed how autoencoders can connect the latent space and the image semantic space, as well as how the latent code can be used for image editing without influencing the image style. next, we want to discuss how this can be used for artistic expression. while in the former section we have seen how to use manipulation in the latent space to generate mathematical sound operations not much artistic content has been generatedjust variations of photography like faces. imprecision in ai systems can lead to unacceptable errors in the system and even result in deadly decisions; e.g. at autonomous driving or at cancer treatment. in the case of artistic applications, errors or glitches might lead to interesting, non-intended, artifacts. if those errors or glitches are treated as a bug or a feature lies in the eye of the artist. to create higher variations in the generated output some artists randomly introduce glitches within the autoencoder. due to the complex structure of the autoencoder these glitches (assuming that they are introduced at an early layer in the network) occur on a semantic level as already discussed and might cause the models to misinterpret the input data in interesting ways. some could even be interpreted as glimpses of autonomous creativity; see for instance the artistic work 'mistaken identity' by mario klingemann [10] . so far the latent space is explored by humans either by random walk or intuitive steering into a particular direction. it is up to human decisions if the synthesized image of a particular location in latent space is producing a visually appealing or otherwise interesting result. the question arises where to find those places and if those places can be spotted by an automatized process. the latent space is usually defined by a space of ddimensions for which it is assumed the data to be represented as multivariate gaussian distributions n (0, i d ) [11] . therefore, the mean representation of all images lies in the center of the latent space. but what does that mean for the generated results? it is said that "beauty lies in the eyes of the beholder", however, research shows that there is a common understanding of beauty. for instance, averaged faces are perceived as more beautiful [12] . adopting these findings to latent space let us assume that the most beautiful images (in our case faces) can be found in the center of the space. particular deviations from the center stand for local sweet spots (e.g. female and male, ethnic groups). these types of sweet spots can be found by common means of data analysis (e.g. clustering). but where are interesting local sweet spots if it comes to artistic expression? figure 6 demonstrates some variation in style within the latent space. of course, one can search for locations in the latent space where particular artworks from a given artist or art styles are located; see e.g. figure 7 where the styles of different artists, as well as white noise 2 , have been used for adoption. but isn't lingering around these sweet spots not only producing "more of the same"? how to find the local sweet spots which can define a new art style and can be deemed truly creative? or do those discoveries of new art style lie outside of the latent space, because the latent space is trained within a particular set of defined art styles and can, therefore, produce only interpolations of those styles but nothing conceptually new? so far we have discussed how ai can help to generate different variations of faces and where to find visually interesting sweet spots. in this section, we want to show how ai is supporting the creation process by applying the discussed techniques to other areas of image and object processing. 3 probably, different variations of image-to-image translation are the most popular approach at least if looking at the mass media. the most prominent example is style transfer -the capability to transfer the style of one image to draw the content of another (examples are shown in figure 7 ). but mapping an input image to an output image is also possible for a variety of other applications such as object transfiguration (e.g. horse-to-zebra, apple-to-orange, season transfer (e.g. summer-to-winter) or photo enhancement [13] . while some of the just mentioned systems are not yet in a state to be widely applicable, ai tools are taking over and gradually automating design processes which used to be time-consuming manual processes. indeed, the most potential for ai in art and design is seen in its application to tedious, uncreative tasks such as coloring black-and-white images [14] . marco kempf and simon zimmerman used ai in their work dubbed 'deepworld' to generate a compilation of 'artificial countries' using data of all existing countries (around 195) to generate new anthems, flags and other descriptors [15] . roman lipski uses an ai muse (developed by florian dohmann et al.) to foster his/her inspiration [16] . because the ai muse is trained only on the artist's previous drawings and fed with the current work in progress it suggests image variations in line with roman's taste. cluzel et al. have proposed an interactive genetic algorithm to progressively sketch the desired side-view of a car profile [17] . for this, the user has taken on the role of a fitness function 4 through interaction with the system. the chair project [18] is a series of four chairs co-designed by ai and human designers. the project explores a collaborative creative process between humans and computers. it used a gan to propose new chairs which then have been 'interpreted' by trained designers to resemble a chair. deep-wear [19] is a method using deep convolutional gans for clothes design. the gan is trained on features of brand clothes and can generate images that are similar to actual clothes. a human interprets the generated images and tries to manually draw the corresponding pattern which is needed to make the finished product. li et al. [20] introduced an artificial neural network for encoding and synthesizing the structure of 3d shapes which-according to their findings-are effectively characterized by their hierarchical organization. german et al. [21] have applied different ai techniques trained by a small sample set of shapes of bottles, to propose novel bottle-like shapes. the evaluation of their proposed methods revealed that it can be used by trained designers as well as nondesigners to support the design process in different phases and that it could lead to novel designs not intended/foreseen by the designers. for decades, ai has fostered (often false) future visions ranging from transhumanist utopia to "world run by machines" dystopia. artists and designers explore solutions concerning the semiotic, the aesthetic and the dynamic realm, as well as confronting corporate, industrial, cultural and political aspects. the relationship between the artist and the artwork is directly connected through their intentions, although currently mediated by third-parties and media tools. understanding the ethical and social implications of ai-assisted creation is becoming a pressing need. the implications, where each has to be investigated in more detail in the future, include: -bias: al systems are sensitive to bias. as a consequence, the ai is not being a neutral tool, but has pre-decoded preferences. bias relevant in creative ai systems are: • algorithmic bias occurs when a computer system reflects the implicit values of the humans who created it; e.g. the system is optimized on dataset a and later retrained on dataset b without reconfiguring the neural network (this is not uncommon, as many people do not fully understand what is going on in the network, but are able to use the given code to run training on other data). • data bias occurs when your samples are not representative of your population of interest. • prejudice bias results from cultural influences or stereotypes which are reflected in the data. -art crisis: until 200 years ago painting served as the primary method for visual communication and was a widely and highly respected art form. with the invention of photography, painting began to suffer an identity crisis because painting-in its current form then-was not able to reproduce the world as accurate and with as low effort as photography. as a consequence visual artists had to change to different forms of representations not possible by photography inventing different art styles such as impressionism, expressionism, cubism, pointillism, constructivism, surrealism, up to abstract expressionism. at the time ai can perfectly simulate those styles what will happen with the artists? will artists still be needed, be replaced by ai, or will they have to turn to other artistic work which yet cannot be simulated by ai? -inflation: similar to the image flood which has reached us the same can happen with ai art. because of the glut, nobody is valuing and watching the images anymore. -wrong expectations: only esthetic appealing or otherwise interesting or surprising results are published which can be contributed to similar effects as the well-known publication bias [22] in other areas. eventually, this is leading to wrong expectations of what is already possible with ai. in addition, this misunderstanding is fueled by content claimed to be created by ai but has indeed been produced-or at least reworked-either by human labor or by methods not containing ai. -unequal judgment: even though the raised emotions in viewing artworks emerge from its underlying structure in the works, people also include the creation process in their judgment (in the cases where they know about it). frequently, becoming to know that a computer or an ai has created the artwork, in the opinion of the people it turns boring, has no guts, no emotion, no soul while before it was inspiring, creative and beautiful. -authorship: the authorship of ai-generated content has not been clarified. for instance, is the authorship of a novel song composed by an ai trained exclusively on songs by johann sebastian bach belonging to the ai, the developer/artist, or bach? see e.g. [23] for a more detailed discussion. -trustworthiness: new ai-driven tools make it easy for non-experts to manipulate audio and/or visual media. thus, image, audio as well as video evidence is not trustworthy anymore. manipulated image, audio, and video are leading to fake information, truth skepticism, and claims that real audio/video footage is fake (known as the liar's dividend ) [24] . the potential of ai in creativity has just been started to be explored. we have investigated on the creative power of ai which is represented-not exclusively-in the semantic meaningful representation of data in a dimensionally reduced space, dubbed latent space, from which images, but also audio, video, and 3d models can be synthesized. ai is able to imagine visualizations that lie between everything the ai has learned from us and far beyond and might even develop its own art styles (see e.g. deep dream [25] ). however, ai still lacks intention and is just processing data. those novel ai tools are shifting the creativity process from crafting to generating and selecting-a process which yet can not be transferred to machine judgment only. however, ai can already be employed to find possible sweet spots or make suggestions based on the learned taste of the artist [21] . ai is without any doubt changing the way we experience art and the way we do art. doing art is shifting from handcrafting to exploring and discovering. this leaves humans more in the role of a curator instead of an artist, but it can also foster creativity (as discussed before in the case of roman lipski) or reduce the time between intention and realization. it has the potential, just as many other technical developments, to democratize creativity because the handcrafting skills are not so much in need to express his/her own ideas anymore. widespread misuse (e.g. image manipulation to produce fake pornography) can limit the social acceptance and require ai literacy. as human beings, we have to ask ourselves if feelings are wrong just because the ai never felt alike in its creation process as we do? or should we not worry too much and simply enjoy the new artworks created no matter if they are done by humans, by ai or as a co-creation between the two ones? [1] aims to design and implement a machine learning system for the sake of generating prediction models with respect to quality checks and reducing faulty products in manufacturing processes. it is based on an industrial case study in cooperation with sick ag. we will present first results of the project concerning a new process model for cooperating data scientists and quality engineers, a product testing model as knowledge base for machine learning computing and visual support of quality engineers in order to explain prediction results. a typical production line consists of various test stations that conduct several measurements. those measurements are processed by the system on the fly, to point out problematic products. among the many challenges, one focus of the project is on support for quality engineers. preparation of prediction models is usually done by data scientists. but the demand for data scientists is increasing too fast, when a big number of products, production lines and changing circumstances have to be considered. hence, a software is needed which quality engineers can operate directly and leverage the results from prediction models. based on quality management and data science standard processes [2] [3] we created a reference process model for production error detection and correction which includes needed actors and associated tasks. with ml system and data scientist assistance we bolster the quality engineer in his work. to support the ml system, we developed a product testing model which includes crucial information about a specific product. in this model we describe the relation to product specific features, test systems, production lines sequences etc. the idea behind this, is to provide metadata information which in turn is used by the ml system instead of individual script solutions for each product. a ml model with good predictions has often a lack of information about the internal decisions. therefore, it is beneficial to support the quality engineer with useful feature visualizations. by default, we support the quality engineer with 2d -3d feature plots and histograms, in which the error distribution is visualized. on top, we developed further feature importance measures based on shap values [4] . these can be used to get deeper insight for particular ml decisions to significant features which get lower ranked by standard feature importance measures. medicine is a highly empirical discipline, where important aspects have to be demonstrated using adequate data and sound evaluations. this is one of the core requirements, which were emphasized during the development of the medical device regulation (mdr) of the european union (eu) [1] . this applies to all medical devices, including mechanical and electrical devices as well as software systems. also, the us food & drug administration (fda) recently set a focus on the discussions about using data for demonstrating the safety and efficacy of medical devices [2] . beside pure approval steps, they foster the use of data for optimization of the products, as nowadays data can be acquired more and more, using modern it technology. in particular, they pursue the use of real world evidence, i.e. data that is collected through the lifetime of a device, for demonstrating improved outcomes. [2] such approaches require the use of sophisticated data analysis techniques. beside classical statistics, artificial intelligence (ai) and machine learning (ml) are considered to be powerful techniques for this purpose. currently, they gain more and more attention. these techniques allow to detect dependencies in complex situations, where inputs and/or outputs of a problem have high-dimensional parameter spaces. this can e.g. be the case when extensive data is collected from diverse clinical studies or also treatment protocols from local sites. furthermore, ai/ml based techniques may be used in the devices themselves. for example, devices may be developed which are considered to improve complex diagnostic tasks or find individualized treatment options for specific medical conditions (see e.g. [3, 4] for an overview). for some applications, it already has been demonstrated that ml algorithms are able to outperform human experts with respect to specific success rates (e.g. [5, 6] ). in this paper, it will be discussed how ml based techniques can be brought onto the market including an analysis of appropriate regulatory requirements. for this purpose, the main focus lies on ml based devices applied in the intensive care unit (icu) as e.g. proposed in [7, 8] . the need for specific regulatory requirements comes from the observation, that ai/ml based techniques pose specific risks which need to be considered and handled appropriately. for example, ai/ml based methods are more challenging w.r.t. bias effects, reduced transparency, vulnerability to cybersecurity attacks, or general ethical issues (see e.g. [9, 10] ). in particular cases, ml based techniques may lead to noticeably critical results, as it has been shown for the ibm watson for oncology device. in [11] , it was reported that the direct use of the system in particular clinical environments resulted in critical treatment suggestions. the characteristics of ml based systems led to various discussions about their reliability in the clinical context. it requires to find appropriate ways to guarantee their safety and performance. (cf. [12] ) this applies to the field of medicine / medical devices as well as ai/ml based techniques in general. the latter was e.g. approached by the eu in their ethics guidelines for trustworthy ai [9] . driven by this overall development, the fda started a discussion regarding an extended use of ml algorithms in samd (software as a medical device) with a focus in quicker release cycles. in [13] , it pursued the development of a specific process which makes it easier to bring ml based devices onto the market and also to update them during their lifecycle. current regulations for medical devices, e.g. in us or eu, do not provide specific guidelines for ml based devices. in particular, this applies to systems which continuously collect data in order to improve the performance of the device. current regulations focus on a fixed status of the device, which may only be adapted in a minor extent after the release. usually, a new release or clearance by the authority is required, when the clinical performance of a device is modified. but continuously learning systems exactly want to do such improvement steps using additional real-world data from daily applications without extra approvals (see fig. 1 ). basic approaches for ai/ml based medical devices. left side: classical approach, where the status of the software has to be fixed after the release / approval stage. right side: continuously learning system where data is collected during the lifetime of the device without a separated release / approval step. in this case, an automatic validation step has to guarantee proper safety and efficacy. in [13] , the fda made suggestions how this could be addressed. it proposed the definition of so called samd pre-specifications (sps) and an algorithm change protocol (acp), which are considered to represent major tools for dealing with modifications of the ml based system during its lifetime. within the sps, the manufacturer has to define the anticipated changes which are considered to be allowed during the automatic update process. in addition, the acp defines the particular steps which have to be implemented to realize the sps specifications. see [13] for more information about sps and acp. but the details are not yet well elaborated by the fda at the moment. the fda requested for suggestions with respect to this. in particular, these tools serve as a basis for performing an automated validation of the updates. the applicability of this approach depends on the risk of the samd. in [13] , the fda uses the risk categories from the international medical device regulators forum (imdrf) [14] . this includes the categories state of healthcare situation or condition (critical vs. serious vs. noncritical) and significance of information provided by samd to healthcare decision (treat or diagnose vs. drive clinical management vs. inform clinical management) as the basic attributes. according to [13] , the regulatory requirements for the management of ml based systems are considered to depend on this classification as well as the particular changes which may take place during the lifetime of the device. the fda categorizes them as changes in performance, inputs, and intended use. such anticipated changes have to be defined in the sps in advance. the main purpose of the present paper is to discuss the validity of the described fda approach for enabling continuously learning systems. therefore, it uses a scenario based technique to analyze whether validation in terms of sps and acp can be considered adequate tools. the scenarios represent applications of ml based devices in the icu. it checks its consistency with other important regulatory requirements and analyzes pitfalls which may jeopardize the safety of the devices. additionally, it discusses whether more general requirements can be sufficiently addressed in the scenarios, as e.g. proposed in ethical guidelines for ai based systems like [9, 10] . this is not considered as a comprehensive analysis of the topics, but as an addition to current discussions about risks and ethical issues, as they are e.g. discussed in [10, 12] . finally, the paper proposes own suggestions to address the regulation of continuously learning ml based systems. again, this is not considered to be a full regulatory strategy, but a proposal of particular requirements, which may overcome some of the current limitations of the approach discussed in [13] . the overall aim of this paper is to contribute to a better understanding of the options and challenges of ai/ml based devices on the one hand and to enable the development of best practices and appropriate regulatory strategies, in the future. within this paper, the analysis of the fda approach proposed in [13] is performed using specific reference scenarios from icu applications, which are particularly taken from [13] itself. the focus lies on ml based devices which allow continuous updates of the model according to data collected during the lifetime of the device. in this context, sps and acp are considered as crucial steps which allow an automated validation of the device based on specified measures. in particular, the requirements and limitations of such an automated validation are analyzed and discussed, including the following topics / questions. is automated validation reasonable for these cases? what are limitations / potential pitfalls of such an approach when applied in the particular clinical context? which additional risks could apply to ai/ml based samd, in general, which go beyond the existing discussions in the literature as e.g. presented in [9, 10, 12] ? how should such issues be taken into account in the future? what could be appropriate measures / best practices to achieve reliability? the following exemplary scenarios are used for this purpose. ur-ai 2020 // 56 base scenario icu: ml based intensive care unit (icu) monitoring system where the detection of critical situations (e.g. regarding physiological instability, potential myocardial infarcts or sepsis) is addressed by using ml. using auditory alarms, the icu staff is informed to initiate appropriate measures to treat the patients in these situations. this scenario addresses a 'critical healthcare situation or condition' and is considered to 'drive clinical management' (according to the risk classification used in [13] ). modification "locked": icu scenario as presented above, where the release of the monitoring system is done according to a locked state of the algorithm. modification "cont-learn": icu scenario as presented above, where the detection of alarm situations is continuously improved according to data acquired during daily routine, including adaptation of performance to sub-populations and/or characteristics of the local environment. in this case, scs and acp have to define standard measures like success rates of alarms/detection and requirements for the management of data, update of the algorithm, and labeling. more details of such requirements are discussed later. this scenario was presented as scenario 1a in [13] with minor modifications. this section provides the basic analysis of the scenarios according to the particular aspects addressed in this paper. it addresses the topics automated validation, man-machine interaction, explainability, bias effects, and confounding, fairness and non-discrimination as well as corrective actions to systematic deficiencies. according to standard regulatory requirements [1, 15, 16] , validation is a core step in the development and for the release of medical devices. according to [17] , a change in performance of a device (including an algorithm in a samd) as well as a change in particular risks (e.g. new risks, but also new risk assessment or new measures) usually triggers a new premarket notification (510(k)) for most of the devices which get onto the market in the us. thus, such situations require an fda review for clearance of the device. for samd, this requires to include an analytical evaluation, i.e. correct processing of input data to generate accurate, reliable, and precise output data. additionally, a clinical validation as well as the demonstration of a valid clinical association need to be provided. [18] this is intended to show that the outputs of the device appropriately work in the clinical environment, i.e. have a valid association regarding the targeted clinical condition and achieve the intended purpose in the context of clinical care. [18] thus, based on the current standards, a device with continuously changing performance usually requires a thorough analysis regarding its validity. this is one of the main points, where [13] proposes to establish a new approach for the "cont-learn" cases. as already mentioned, sps and acp basically have to be considered as tools for automated validation in this context. within this new approach, the manual validation step is replaced by an automated process with only reduced or even no additional control by a human observer. thus, it may work as an automated of fully automatic, closed loop validation approach. the question is whether this change can be considered as an appropriate alternative. in the following, this question is addressed using the icu scenario with a main focus on the "cont-learn" case. some of the aspects also apply to the "locked" cases. but the impact is considered to be higher in the "cont-learn" situation, since the validation step has to be performed in an automated fashion. human oversight, which is usually considered important, is not included here during the particular updates. within the icu scenario, the validation step has to ensure that the alarm rates stay on a sufficiently high level, regarding standard factors like specificity, sensitivity, area under curve (auc), etc. basically, these are technical parameters which can be analyzed according to an analytical evaluation as discussed above. (see also [18] ) this could also be applied to situations, where continuous updates are made during the lifecycle of the device, i.e. in the "cont-learn". however, there are some limitations of the approach. on the one hand, it has to be ensured, that this analysis is sound and reliable, i.e. it is not compromised according to statistical effects like bias or other deficiencies in the data. on the other hand, it has to be ensured that the success rates really have a valid clinical association and can be used as a sole criterion for measuring the clinical impact. thus, the relationship between pure success rates and clinical effects has to be evaluated thoroughly and there may be some major limitations. one major question in the icu scenario is, whether better success rates really guarantee a higher or at least sufficient level of clinical benefit. this is not innately given. for example, a higher success rate of the alarms may still have a negative effect when the icu staff relies more and more on the alarms and subsequently reduces attention. thus, it may be the case that the initiation of appropriate treatment steps may be compromised even though the actually occurring alarms seem to be more reliable. in particular, this may apply in situations where the algorithms are adapted to local settings, like in the "cont-learn" scenario. here, the ml based system is intended to be optimized to subpopulations in the local environment or to specific treatment preferences at the local site. according to habituation effects, the staff's expectations get aligned to the algorithm's behavior to a certain degree after a period of time. but when the algorithm changes or an employee from another hospital or department takes over duties in the local unit, the reliability of the alarms may be affected. in these cases, it is not clear whether the expectations are well aligned with the current status of the algorithmeither in the positive or negative direction. since the data updates of the device are intended to improve its performance w.r.t. detection rates, it is clear that significant effects on user interaction may happen. under some circumstances, the overall outcome in terms of the clinical effect may be impaired. evaluation of such risks have to be addressed during validation. it is questionable whether this can be performed by using an automatic validation approach which focuses on alarm rates but does not include an assessment of the associated risks. at least a clear relationship between these two aspects has to be demonstrated in advance. it is also unclear, whether this could be achieved by assessment of pure technical parameters which are defined in advance as required by the sps and acp. usually, ml based systems are trained to a specific scenario. they provide a specific solution for this particular problem. but they do not have a more general intelligence and reasoning about potential risks, which were not under consideration at that point of time. such a more general intelligence can only be provided when using human oversight. in general, it is not clear whether technical aspects like alarms lead to valid reactions by the users. in technical terms, alarm rates are basically related to the probability of occurrence of specific hazardous situations. but they do not address a full assessment of occurrence of harm. however, this is pivotal for risk assessment in medical devices, in particular for risks related to potential use errors. this is considered to be one of the main reasons why a change in risk parameters triggers a new premarket approval in the us according to [17] . also, the mdr [1] sets high requirements to address the final clinical impact and not only technical parameters. basically, the example emphasizes the importance to consider the interaction between man and machine, or in this case, the algorithm and its clinical environment. this is addressed in the usability standards for medical devices, e.g. iso 62366 [19] . for this reason, the iso 62366 requires that the final (summative) usability evaluation is performed using the final version of the device (in this case, the algorithm) or an equivalent version. this is in conflict with the fda proposal which allows to perform this assessment based on previous versions. at most, a predetermined relationship between technical parameters (alarm rates) and clinical effects (in particular, use related risks) can be obtained. for usage of ml based devices, it remains crucial to consider the interaction between the device and the clinical environment as there usually are important interrelationships. the outcome of an ml based algorithm always depends on the data it gets provided. whenever an input parameter is omitted, which is clinically relevant, the resulting outcome of the ml based system is limited. in the presented scenarios, the pure alarm rates may not be the only clinically relevant outcomes. even though, such parameters are usually the main focus regarding the quality of algorithms, e.g. in publications about ml based techniques. this is due to the fact, that such quality measures are commonly considered the best available objective parameters, which allow a comparison of different techniques. this even more applies to other ml based techniques which are also very popular in the scientific community, like segmentation tasks in medical image analysis. here the standard quality measures are general distance metrics, i.e. differences between segmented areas. [20] they usually do not include specific clinical aspects like the accuracy in specific risk areas, e.g. important blood vessels or nerves. but such aspects are key factors to ensure the safety of a clinical procedure in many applications. again, only technical parameters are typically in focus. the association to the clinical effects is not assessed accordingly. this situation is depicted in fig. 2 for the icu as well as image segmentation cases. additionally, the validity of an outcome in medical treatments depends on many factors. regarding input data, multiple parameters from a patient's individual history may be important for deciding about a particular diagnosis or treatment. a surgeon usually has access to a multitude of data and also side conditions (like socio-economic aspects) which should be included in an individual diagnosis or treatment decision. his general intelligence and background knowledge allow him to include a variety of individual aspects, which have to be considered for a specific case-based decision. in contrary, ml based algorithms rely on a more standardized structure of input data and are only trained for a specific purpose. they lack a more general intelligence, which allows them to react in very specific situations. even more, ml based algorithms need to generalize and thus to mask out very specific conditions, which could by fatal in some cases. in [13] , the fda presents some examples where changes of the inputs in an ml based samd are included. it is surprising, that the fda considers some of them as candidates for a continuous learning system, which does not need an additional review, when a tailored sps/acp is available. such discrepancies between technical outcomes and clinical effects also apply to situations like the icu scenario, which only informs or drives clinical management. often users rely on automatically provided decisions, even when they are informed that this only is a proposal. again, this is a matter of man-machine interaction. this gets even worse due to the lack of explainability which ml based algorithms typically have. [9, 21] when surgeons or more general users (e.g, icu staff) detect situations which require a diverging treatment because of very specific individual conditions, they should overrule the algorithm. but users will often be confused by the outcome of the algorithm and do not have a clear idea how they should treat conflicting results between the algorithm's suggestions and their own belief. as long as the ml based decision is not transparent to the user, they will not be able to merge these two directions. the ibm watson example, referenced in the introduction shows, that this actually is an issue [11] . this may be even more serious, when the users (i.e. healthcare professionals) fear litigation because they did not trust the algorithm. in a situation, where the algorithm's outcome finally turns out to be true, they may be sued because of this documented deviation. because of such issues, the eu general data protection regulation (gfpr) [22] requires that the users get autonomy regarding their decisions and transparency about the mechanisms underlying the algorithm's outcome. [23] this may be less relevant for the patients, who usually have only limited medical knowledge. they will probably also not understand the medical decisions in conventional cases. but it is highly relevant for responsible healthcare professionals. they require to get basic insights how the decision emerged, as they finally are in charge of the treatment. this demonstrates that methods regarding the explainability of ml based techniques are important. fortunately, this currently gets a very active field. [21, 24] this need for explainability applies to locked algorithms as well as situations where continuous learning is applied. based on their own data-driven nature, ml based techniques highly depend on a very high quality of data which are provided for learning and validation. in particular, this is important for the analytical evaluation of the ml algorithms. one of the major aspects are bias effects due to unbalanced input data. for example, in [25] a substantially different detection rate between white and colored people was recognized due to unbalanced data. beside ethical considerations, this demonstrates dependencies of the outcome quality on sub-populations, which may be critical in some cases. even though, the fda proposal [13] currently does not consequently include specific requirements for assessing bias factors or imbalance of data. however, high quality requirements for data management are crucial for ml based devices. in particular, this applies to the icu "cont-learn" cases. there have to be very specific protocols that guarantee that new data and updates of the algorithms are highly reliable w.r.t. bias effects. most of the currently used ml based algorithms fall under the category of supervised learning. thus, they require accurate and clinically sound labeling of the data. during the data collection, it has to be ensured how this labeling is performed and how the data can be fed back into the system in a "cont-learn" scenario. additionally, the data needs to stay balancedwhatever this means in a situation where adaptions to sub-populations and/or local environments are intended for optimization. it is unclear, whether and how this could be achieved by staff who is only operating with the system but possibly does not know potential algorithmic pitfalls. in the icu scenario, many data points probably need to be recorded by the system itself. thus, a precise and reliable recording scheme has to be established which automatically avoids imbalance of data on the one hand and fusion with manual labelings on the other hand. basically, the sps and acp (proposed in [13] ) are tools to achieve this. the question is whether this is possible in a reliable fashion using automated processes. a complete closed loop validation approach seems to be questionable, especially when the assessment of clinical impact has to be included. thus, the integration of humans including adequate healthcare professionals as well as ml/ai experts with sufficient statistical knowledge seems reasonable. at least, bias assessment steps should be included. as already mentioned, this is not addressed in [13] in a dedicated way. further on, the outcomes may be compromised by side effects in the data. it may be the case, that the main reason for a specific outcome of the algorithm is not a relevant clinical parameter but a specific data artifact, i.e. some confounding factor. in the icu case, it could be the case, that the icu staff reacts early to a potentially critical situation and e.g. gives specific medication in advance to prevent upcoming problems. the physiological reaction of the patient can then be visible in the data as some kind of artifact. during its learning phase, the algorithm may recognize the critical situation not based on a deeper clinical reason, but on detecting the physiological reaction pattern. this may cause serious problems as shown subsequently. in the presented scenario, the definition of clinical situation and the pattern can be deeply coupled by design, since the labeling of the data by the icu staff and the administration of the medication will probably be done in combination at the particular site. this may increase the probability of such effects. usually, confounding factors are hard to determine. even when they can be detected, they are hard to be communicated and managed in an appropriate way. how should healthcare professionals react, when they get such potentially misleading information (see discussion about liability). this further limits the explanatory power of ml based systems. when confounders are not detected, they may have unpredictable outcomes w.r.t. the clinical effects. for example, consider the following case. in the icu scenario, an ml based algorithm gets trained in a way that it basically detects the medication artifact as described above during the learning phase. in the next step, this algorithm is used in clinical practice and the icu staff relies on the outcome of the algorithm. then, on the one hand, the medication artifact is not visible unless the icu staff administers the medication. on the other hand, the algorithm does not recognize the pattern and thus does not provide an alarm. subsequently, the icu staff does no act appropriately to manage the critical situation. in particular, such confounders may be more likely in situations where a strong dependence between the outcome of the algorithm and the clinical treatment exists. further examples of such effects were discussed in [7] for icu scenarios. the occurrence of confounders may be a bit less probable in pure diagnostic cases without influence of the diagnostic task onto the generation of data. but even here, such confounding factors may occur. the discussion in [10] provided examples where confounders may occur in diagnostic cases e.g. because of rulers placed for measurements on radiographs. in most of the publications about ml based techniques, such side effects are not discussed (or only in a limited fashion). in many papers, the main focus is the technical evaluation and not the clinical environment and the interrelation between technical parameters and clinical effects. additional important aspects which are amply discussed in the context of ai/ml based systems are discrimination and fairness (see e.g. [10] ). in particular, the eu puts a high priority of their future ai/ml strategy on fairness requirements [9] . fairness is often closely related to bias effects. but it goes beyond to more general ethical questions, e.g. regarding the natural tendency of ml based systems to favor specific subgroups. for example, the icu scenario "cont-learn" is intended to optimize w.r.t. to specifics of sub-populations and local characteristics, i.e. it tries to make the outcome better for specific groups. based on such optimization, other groups (e.g. minorities, underrepresented groups) which are not well represented may be discriminated in some sense. this is not a statistical but a systematic effect. superiority of a medical device for a specific subgroup (e.g. gender, social environment, etc.) is not uncommon. for example, some diagnosis steps, implants, or treatments achieve deviating success rates when applied to women in comparison to men. this also applies to differences between adults and children. when assessing bias in clinical outcome in ml based devices, it will probably often be unclear whether this is due to imbalance of data or a true clinical difference between the groups. does an ml based algorithm has to adjust the treatment of a subgroup to a higher level, e.g. a better medication, to achieve comparable results, when the analysis recognized worse results for this subgroup? another example could be a situation where the particular group does not have the financial capabilities to afford the high-level treatment. this could e.g. be the case in a developing country or in subgroups with a lower insurance level. in these cases, the inclusion of socio-economical parameters into the analysis seems to be unavoidable. subsequently, this compromises the notion of fairness as basic principle in some way. this is nothing genuine to ml based devices. but in the case of ml based systems with a high degree of automation, the responsibility for the individual treatment decision more and more shifts from the health care professional to the device. it is implicitly defined in the ml algorithm. in comparison to human reasoning, which allows some weaknesses in terms of individual adjustments of general rules, ml based algorithms are rather deterministic / unique in their outcome. for a fixed input, they have one dedicated outcome (when we neglect statistical algorithms which may allow minor deviations). differences of opinions and room for individual decisions are main aspects of ethics. thus, it remains unclear how fairness can be defined and implemented at all when considering ml based systems. this is even more challenging as socioeconomical aspects (even more than clinical aspects) are usually not included in the data and analysis of ml based techniques in medicine. additionally, they are hard to assess and implement in a fair way, especially when using automated validation processes. another disadvantage of ml based devices is the limited opportunities to fix systematic deficiencies in the outcome of the algorithm. let us assume that during the lifetime of the icu monitoring system a systematic deviation of the intended outcome was detected, e.g. in the context of post-market surveillance or due to an increased number of serious adverse events. according to standard rules, a proper preventive respectively corrective action has to be taken by the manufacturer. in conventional software devices, the error simple should be eliminated, i.e. some sort of bug fixing has to be performed. for ml based devices it is less clear, how bug fixing should work especially when the systematic deficiency is deeply hidden in the data and/or ml model. in these cases, there usually is no clear reason for the deficiency. subsequently, the deficiency cannot be resolved in a straightforward way using standard bug fixing. there is no dedicated route to find the deeper reasons and to perform changes which could cure the deficiencies, e.g. by providing additional data or changing the ml model. even more, other side effects may easily occur, when data and model are changed manually by intent to fix the issue. 4 discussion and outlook in summary, there are many open questions, which are not yet clarified. there still is little experience how ml based systems work in clinical practice and which concrete risks may occur. thus, the fda's commitment to foster the discussion about ml based samd is necessary and appreciated by many stakeholders as the feedback docket [26] for [13] shows. however, it is a bit surprising that the fda proposes to substantially reduce its very high standards in [13] at this point of time. in particular, it is questionable whether an adequate validation can be achieved by using a fully automatic approach as proposed in [13] . ml based devices are usually optimized according to very specific goals. they can only account for the specific conditions that are reflected in the data and the used optimization / quality criteria. they do not include side conditions and a more general reasoning about potential risks in a complex environment. but this is important for medical devices. for this reason, a more deliberate path would be suited, from the author's perspective. in a first step, more experience should be gained w.r.t. to the use of ml based devices in clinical practice. thus, continuous learning should not be a first hand option. first, it should be demonstrated that a device works in clinical practice before a continuous learning approach should be possible. this could also be justified from a regulatory point-of-view. the automated validation process itself should be considered as a feature of the device. it should be considered as part of the design transfer which enables safe use of the device during its lifecycle. as part of the design transfer, it should be validated itself. thus, it has to be demonstrated that this automated validation process, e.g. in terms of the sps and acp, works in a real clinical environment. ideally, this would have been demonstrated during the application of the device in clinical practice. thus, one reasonable approach for a regulatory strategy could be to reduce or prohibit the options for enabling automatic validation in a first release / clearance of the device. during the lifetime, direct clinical data could be acquired to demonstrate a better insight into the reliability and limitations of the automatic validation / continuous learning approach. in particular, the relation between technical parameters and clinical effects could be assessed on a broader and more stable basis. based on this evidence in real clinical environments, the automated validation feature could then be cleared in a second round. otherwise, the validity of the automated validation approach would have to be demonstrated in a comprehensive setting during the development phase. in principle, this is possible when enough data is available which truly reflects a comprehensive set of situations. as discussed in this paper, there are many aspects which do not render this approach impossible but very challenging. in particular, this applies to the clinical effects and the interdependency between the users and clinical environment on the one hand and the device, including the ml algorithm, data management, etc., on the other hand. this also includes not only variation in the status and needs of the individual patient but also the local clinical environment and potentially also the socioeconomic setting. following a consequent process validation approach, it would have to be demonstrated that the algorithm reacts in a valid and predictable way no matter which training data have been provided, which environment have to be addressed, and which local adjustments have been applied. this also needs to include deficient data and inputs in some way. in [20] , it has been shown that the variation of outcomes can be substantial, even w.r.t. rather simple technical parameters. in [20] , this was analyzed for scientific contests ("challenges") where renowned scientific groups supervised the quality of the submitted ml algorithms. this demonstrates the challenges validation steps for ml based systems still include, even w.r.t. technical evaluation. for these reasons, it seems adequate to pursue the regulatory strategy in a more deliberate way. this includes the restriction of the "cont-learn" cases as proposed. this also includes a better classification scheme, where automated or fully automatic validation is possible. currently, the proposal in [13] does not provide clear rules when continuous learning is allowed. it does not really address a dedicated risk-based approach that defines which options and limitations are applicable. for some options, like the change of the inputs, it should be reviewed, whether automatic validation is a natural option. additionally, the dependency between technical parameters and clinical effects as well as risks should get more attention. in particular, the grade of interrelationship between the clinical actions and the learning task should be considered. in general, the discussions about ml based medical devices are very important. these techniques provide valuable opportunities for improvements in fields like medical technologies, where evidence based on high quality data is crucial. this applies to the overall development of medicine as well as to the development of sophisticated ml based medical devices. this also includes the assessment of treatment options and success of particular devices during their lifetime. data-driven strategies will be important for ensuring high-level standards in the future. they may also strengthen regulatory oversight in the long term by amplifying the necessity of post-market activities. this seems to be one of the promises the fda envisions according to their concepts of "total product lifecycle quality (tplc)" and "organizational excellence" [13] . also, the mdr strengthens the requirements for data-driven strategies in the pre-as well as postmarket phase. but it should not shift the priorities for a basically proven-quality-in-advance (exante) to a primarily ex-post regulation, which boils down to a trial-and-error oriented approach in the extreme. thus, we should aim at a good compromise between pushing these valuable and innovative options on the one hand and potential challenges and deficiencies on the other hand. computer-assisted technologies in medical interventions are intended to support the surgeon during treatment and improve the outcome for the patient. one possibility is to augment reality with additional information that would otherwise not be perceptible to the surgeon. in medical applications, it is particularly important that demanding spatial and temporal conditions are adhered to. challenges in augmenting the operating room are the correct placement of holograms in the real world, and thus, the precise registration of multiple coordinate frames to each other, the exact scaling of holograms, and the performance capacity of processing and rendering systems. in general, two different scenarios can be distinguished. first, applications exist, in which a placement of holograms with an accuracy of 1 cm and above are sufficient. these are mainly applications where a person needs a three-dimensional view of data. an example in the medical field may be the visualization of patient data, e.g. to understand and analyse the anatomy of a patient, for diagnosis or surgical planning. the correct visualization of these data can be of great benefit to the surgeon. often only 2d patient data is available, such as ct or mri scans. the availability of 3d representations depend strongly on the field of application. in neurosurgery 3d views are available but often not extensively utilized due to their limited informative value. additionally computer monitors are a big limitation, because the data can not be visualized in real world scale. further application areas are the translation of known user interfaces into augmented ur-ai 2020 // 67 reality (ar) space. the benefit here is that a surgeon refrains from touching anything, but can interact with the interface in space using hand or voice gestures. applications visualizing patient data, such as ct scans, only require a rough positioning of the image or holograms in the operation room (or). thus, the surgeon can conveniently place the application freely in space. the main requirement is then to keep the holograms in a constant position. therefore, the internal tracking of the ar device is sufficient to hold the holograms at a fixed position in space. the second scenario covers all applications, in which an exact registration of holograms to the real world is required, in particular with a precision below 1 cm. these scenarios are more demanding, especially when holograms must be placed precisely over real patient anatomy. to achieve this, patient tracking is essential to determine position and to follow patient movements. the system therefore needs to track the patient and adjust the visualization to the current situation. furthermore, it is necessary to track and augment surgical instruments and other objects in the operating room. the augmentation needs to be visualized at the correct spatial position and time constraints need to be fulfilled. therefore, the ar system needs to be embedded into the surgical workflow and react to it. to achieve these goals modern state of the art machine learning algorithms are required. however, the computing power on available ar devices is often not yet sufficient for sophisticated machine learning algorithms. one way to overcome this shortcoming is the integration of the ar system into a distributed system with higher capabilities, such as the digital operating theatre op:sense (see fig. 2 ). in this work an augmented reality system holomed [4] (see fig. 1 ) is integrated into the surgical research platform for robot assisted surgery op:sense [5] . the objective is to enable high-quality and patient-safe neurosurgical procedures in order to increase the surgical outcome by providing surgeons with an assistance system that supports them in cognitively demanding operations. the physician's perception limits are extended by the ar system, which bases on supporting intelligent machine learning algorithms. ar glasses allow the neurosurgeon to perceive the internal structures of the patient's brain. the complete system is demonstrated by applying this methodology to the ventricular puncture of the human brain, one of the most frequently performed procedures in neurosurgery. the ventricle system has an elongated shape with a width of 1-2 cm and is located in a depth of 4 cm inside the human head. patient models are generated fast (< 2s) from ct-data [3] , which are superimposed over the patient during operation and serve as a navigation aid for the surgeon. in this work the expanded system architecture is presented to overcome some limitations of the original system where all information were processed on the microsoft hololens, which lead to performance deficits. to overcome these shortcomings the holomed project was integrated into op:sense for additional sensing and computing power. to achieve integration of ar into the operation room and the surgical workflows, the patient, the instruments and the medical staff need to be tracked. to track the patient, a marker system is fixated on the patient head and registration from the marker system to the patient is determined. a two-stage process was implemented for this purpose. first the rough position of the patient's head is determined on the or table by applying a yolo v3 net to reduce the search space. then a robot with a mounted rgb-d sensor is used to scan the acquired area and build a point cloud of the same. to determine the patient's head in space as precisely as possible a two-step surface matching approach is utilized. during recording, the markers are also tracked. with known position of the patient and the markers, the registration matrix can be calculated. for the ventricular puncture a solution is proposed to track the puncture catheter to determine the depth of insertion into the human brain. by tracking the medical staff the system is able to react to the current situation, e.g. if an instrument is passed. in the following the solutions are described in detail. our digital operating room op:sense (illustrated in fig. 2a) to detect the patient's head, the coarse position is first determined with the yolo v3 cnn [6] , performed on the kinect rgb image streams. the position in 3d is determined through the depth stream of the sensors. the or table and the robots are tracked with retroreflective markers by the arttrack system. this step reduces the spatial search area for fine adjustment. the franka panda has an attached intel realsense rgb-d camera as depicted in fig. 3 . the precise determination of the position is performed on the depth data with surface matching. the robot scans the area of the coarsely determined position of the patient's head. a combined surface matching approach with feature-based and icp matching was implemented. the process to perform the surface matching is depicted in fig. 4 . in clinical reality, a ct scan of the patient head is always performed prior to a ventricular puncture for diagnosis, such that we can safely assume the availability of ct data. a process to segment the patient models from ct data was proposed by kunz et al. in [3] . the algorithm processes the ct data extremely fast in under two seconds. the data format is '.nrrd', a volume model format, which can easily be converted into surface models or point clouds. the point cloud of the patient's head ct scan is the reference model that needs to be found in or space. the second point cloud is recorded from the realsense depth stream mounted on the panda robot by scanning the previously determined rough position of the patient head. all points are recorded in world coordinate space. the search space is further restricted with a segmentation step by filtering out points that are located on the or table. additionally, manual changes can be made by the surgeon. in a performance optimization, the resolution of the point clouds is reduced to decrease processing time without loosing too much accuracy. the normals of both point clouds generated from ct data and from the recorded realsense depth stream are subsequently calculated and harmonised. during this step, the harmonisation is especially important as the normals are often misaligned. this misalignment occurs because the ct data is a combination of several individual scans. for alignment of all normals, a point inside the patient's head is chosen manually as a reference point, followed by orienting all normals in the direction of this point and subsequently inverting all normals to the outside of the head (see fig. 5 ). after the preprocessing steps, the first surface fitting step is executed. it is based on the initial alignment algorithm proposed by rusu et al. [8] . an implementation within the point cloud library (pcl) is used. therefore fast point feature histograms need to be calculated as a preprocessing step. in the last step an iterative closest point (icp) algorithm is used to refine the surface matching result. after the two point clouds have been aligned to each other the inverse transformation matrix can be calculated to get the correct transformation from marker system to patient model coordinate space. as outlined in fig. 6 , catheter tracking was implemented based on semantic segmentation using a full-resolution residual network (frrn) [7] . after the semantic segmentation of the rgb stream of the kinect cameras, the image is fused with the depth stream ur-ai 2020 // 71 to determine the voxels in the point cloud belonging to the catheter. as a further step a density based clustering approach [2] is performed on the chosen voxels. this is due to noise especially on the edges of the instrument voxels in the point cloud. based on the found clusters an estimation of the three dimensional structure of the catheter is performed. for this purpose, a narrow cylinder with variable length is constructed. the length is changed accordingly to the semantic segmentation and the clustered voxels of the point cloud. the approach is applicable to identify a variety of instruments. the openpose [1] library is used to track key points on the bodies of the medical staff. available ros nodes have been modified to integrate openpose in the op:sense ros environment. the architecture is outlined in fig. 7 . in this chapter the results of the patient, catheter and medical staff tracking are described. the approach to find the coarse position of a patient's head was performed on a phantom head placed on the or table within op:sense. multiple scenarios with changing illumination and occlusion conditions were recorded. the results are depicted in fig. 8 and the evaluation results are depicted in table 1 . precision detection of the patient was performed with a two-stage surface matching approach. different point cloud resolutions were tested with regard to runtime behaviour. voxel grid edge sizes of 6, 4 and 3 mm have been tested, with a higher edge size corresponding to a smaller point cloud. the matching results of the two point clouds were analyzed manually. an average accuracy of 4.7 mm was found with an accuracy range between 3.0 and 7.0 mm. in the first stage of the surface matching, the two point clouds are coarsely aligned as depicted in fig. 9 . in the second step icp is used for fine adjustment. a two-stage process was implemented as icp requires a good initial alignment of the two point clouds. ur-ai 2020 // 73 for catheter tracking a precision of the semantic segmentation between 47% and 84% is reached (see table 3 ). tracking of instruments, especially neurosurgical catheters, are challenging due to their thin structure and non-rigid shape. detailed results on catheter tracking have been presented in [7] . the 3d estimation of the catheter is shown in fig. 10 . the catheter was moved in front of the camera and the 3d reconstruction was recorded simultaneously. over a long period of the recording over 90% of the catheter are tracked correctly. in some situations this drops to under 50% or lower. the tracking of medical personnel is shown in fig. 11 . the different body parts and joint positions are determined, e.g. the head, eyes, shoulders, elbows, etc. the library yielded very good results as described in [1] . we reached a performance of 21 frames per second on a workstation (intel i7-9700k, geforce 1080 ti) processing 1 stream. fig. 11 . results of the medical staff tracking. ur-ai 2020 // 75 4 discussion as shown in the evaluation, our approach succeeds in detecting the patient in an automated two-stage process with an accuracy between 3 and 7 mm. the coarse position is determined by using a yolo v3 net. the results under normal or conditions are very satisfying. the solution performance drops strongly under bright illumination conditions. this is due to large flares that occur on the phantom as it is made of plastic or silicone. however, these effects do not occur on human skin. the advantage of our system is that the detection is performed on all four kinect rgb streams enable different views on the operation area. unfavourable illumination conditions normally don't occur on all of these streams. therefore a robust detection is still possible. in the future the datasets will be expanded with samples with strong illumination conditions. the following surface matching of the head yields good results and a robust and precise detection of the patient. most important is a good preprocessing of the ct data and the recorded point cloud of the search area, as described in the methods. the algorithm does not manage to find a result if there are larger holes in the point clouds or if the normals are not calculated correctly. additionally, challenges that have to be considered include skin deformities and noisy ct data. the silicone skin is not fixed to the skull (as human skin is), which leads to changes in position, some of which are greater than 1 cm. also the processing time of 7 minutes is quite long and must be optimized in the future. the processing time may be shortened by reducing the size of the point clouds. however, in this case the matching results may also become worse. catheter tracking [7] yielded good results, despite the challenging task of segmenting a very thin ( 2.5 mm) and deformable object. additionally, a 3d estimation of the catheter was implemented. the results showed that in many cases over 90% of the catheter can be estimated correctly. however, these results strongly depend on the orientation and the quality of the depth stream. using higher quality sensors could improve the detection results. for tracking of the medical staff openpose as a ready-to-use people detection algorithm was used and integrated into ros. the library produces very good results, despite medical staff wearing surgical clothing. in this work the integration of augmented reality into the digital operating room op:sense is demonstrated. this makes it possible to expand the capabilities of current ar glasses. the system can determine the precise patient's position by implementing a two-stage process. first a yolo v3 net is used to coarsly detect the patient to reduce the search area. in a second subsequent step a two-stage surface matching process is implemented for refined detection. this approach allows for precise location of the patient's head for later tracking. further, a frnn-based solution to track the surgical instruments in the or was implemented and demonstrated on a thin neurosurgical catheter for ventricular punctures. additionally, openpose was integrated into the digital or to track the surgical personnel. the presented solution will enable the system to react to the current situation in the operating room and is the base for an integration into the surgical workflow. due to the emergence of commodity depth sensors many classical computer vision tasks are employed on networks of multiple depth sensors e.g. people detection [1] or full-body motion tracking [2] . existing methods approach these applications using a sequential processing pipeline where the depth estimation and inference are performed on each sensor separately and the information is fused in a post-processing step. in previous work [3] we introduce a scene-adaptive optimization schema, which aims to leverage the accumulated scene context to improve perception as well as post-processing vision algorithms (see fig. 1 ). in this work we present a proof-of-concept implementation of the scene-adaptive optimization methods proposed in [3] for the specific task of stereomatching in a depth sensor network. we propose to improve the 3d data acquisition step with the help of an articulated shape model, which is fitted to the acquired depth data. in particular, we use the known camera calibration and the estimated 3d shape model to resolve disparity ambiguities that arise from repeating patterns in a stereo image pair. the applicability of our approach can be shown by preliminary qualitative results. in previous work [3] we introduce a general framework for scene-adaptive optimization of depth sensor networks. it is suggested to exploit inferred scene context by the sensor network to improve the perception and post-processing algorithms themselves. in this work we apply the proposed ideas in [3] to the process of stereo disparity estimation, also referred to as stereo matching. while stereo matching has been studied for decades in the computer vision literature [4, 5] it is still a challenging problem and an active area of research. stereo matching approaches can be categorized into two main categories, local and global methods. while local methods, such as block matching [6] , obtain a disparity estimation by finding the best matching point on the corresponding scan line by comparing local image regions, global methods formulate the problem of disparity estimation as a global energy minimization problem [7] . local methods lead to highly efficient real-time capable algorithms, however, they suffer from local disparity ambiguities. in contrast, global approaches are able to resolve local ambiguities and therefore provide high-quality disparity estimations. but they are in general very time consuming and without further simplifications not suitable for real-time applications. the semi-global matching (sgm) introduced by hirschmuller [8] aggregates many feasible local 1d smoothness constraints to approximate global disparity smoothness regularization. sgm and its modifications are still offering a remarkable trade-off between the quality of the disparity estimation and the run-time performance. more recent work from poggi et al. [9] focuses on improving the stereo matching by taking additional high-quality sources (e.g. lidar) into account. they propose to leverage sparse reliable depth measurements to improve dense stereo matching. the sparse reliable depth measurements act as a prior to the dense disparity estimation. the proposed approach can be used to improve more recent end-to-end deep learning architectures [10, 11] , as well as classical stereo approaches like sgm. this work is inspired by [9] , however, our approach does not rely on an additional lidar sensor but leverages a priori scene knowledge in terms of an articulated shape model instead to improve the stereo matching process. we set up four stereo depth sensors with overlapping fields of view. the sensors are extrinsically calibrated in advance, thus their pose with respect to a world coordinates system is known. the stereo sensors are pointed at a mannequin and capture eight greyscale images (one image pair for each stereo sensor, the left image of each pair is depicted in fig. 3a) . for our experiments we use a high-quality laser scan of the mannequin as ground truth. we assume that the proposed algorithm has access to an existing shape model that can express the observed geometry of the scene in some capacity. in our experimental setup, we assume a shape model of a mannequin with two articulated shoulders and a slightly different shape in the belly area of the mannequin (see fig. 2 ). in the remaining section we use the provided shape model to improve the depth data generation of the sensor network. first, we estimate the disparity values of each of the four stereo sensors with sgm without using the human shape model. let p denote a pixel and q denote an adjacent pixel. let d denote a disparity map and d p ,d q denote the disparity at pixel location p and q. let p denote the set of all pixels and n the set of all adjacent pixels. then the sgm cost function can be defined as where d(p, d p ) denotes the matching term (here the sum of absolute differences in a 7 × 7 neighborhood) which assigns a matching cost to the assignment of disparity d p to pixel p and r(p, d p , q, d q ) penalizes disparity discontinuities between adjacent pixels p and q. in sgm the objective given in (1) is minimized with dynamic programming, leading to the resulting disparity mapd = arg min d e(d). as input for the shape model fitting we apply sgm on all four stereo pairs leading to four disparity maps as depicted in fig. 4a . to be able to exploit the articulated shape model for stereo matching we initial need to fit the model to the 3d data obtained by classical sgm as described in 3.2. to be more robust to outliers we do only use disparity values from pixels with high contrast and transform them into 3d point clouds. since we assume that the relative camera poses are known, it is straight forward to merge the resulting point clouds in one world coordinate system. finally the shape model is fitted to the merged point cloud by optimizing over the shape model parameters, namely the pose of the model and the rotation of the shoulder joints. we use an articulated mannequin shape model in this work as a proxy for an articulated human shape model (e.g. [2] ) as proof-of-concept and plan to transfer the proposed approach on real humans in future work. once the model parameters of the shape model are obtained we can reproject the model fit to each sensor view by making use of the known projection matrices. fig. 3b shows the rendered wireframe mesh of the fitted model as an overlay on the camera images. for our guided stereo matching approach we then need the synthetic disparity map which can be computed from the synthetic depth maps (a byproduct of 3d rendering). we denote the synthetic disparity image by d synth . one synthetic disparity image is created for each stereo sensor, see fig. 4b . in the final step we exploit the existing shape model fit, in particular the synthetic disparity image d synth of each stereo sensor and combine it with sgm (inspired by guided stereo matching [9] ). our augmented objective is defined as with the introduced objective is very similar to sgm and can be minimized in a similar fashion leading to the final disparity estimation in our scene-adaptive depth sensor network to summarize our approach, we exploit an articulated shape model fit to enhance sgm with minor adjustments. to show the applicability of our approach we present preliminary qualitative results. the results are depicted in fig. 4 . using sgm without exploiting the provided articulated shape model leads to reasonable results, but the disparity map is very noisy and no clean silhouette of the mannequin is extracted (see fig. 4a ). fitting our articulated shape model to the data leads to clean synthetic disparity maps as shown in fig. 4c , with a clean silhouette. in the belly area the synthetic model disparity map (fig. 4b) does not agree with the ground truth (fig. 4d) . the articulated shape model is not general enough to explain the recorded scene faithfully. using the guided stereo matching approach, we construct a much cleaner disparity map than sgm. in addition, the approach takes the current sensor data into account and exploits an existing articulated shape model. in this work we have proposed a method for scene-adaptive disparity estimation in depth sensor networks. our main contribution is the exploitation of a fitted human shape model to make the estimation of disparities more robust to local ambiguities. our early results indicate that our method can lead to more robust and accurate results compared to classical sgm. future work will focus on a quantitative evaluation as well as incorporating sophisticated statistical human shape models into our approach. inverse process-structure-property mapping abstract. workpieces for dedicated purposes must be composed of materials which have certain properties. the latter are determined by the compositional structure of the material. in this paper, we present the scientific approach of our current dfg funded project tailored material properties through microstructural optimization: machine learning methods for the modeling and inversion of structure-property relationships and their application to sheet metals. the project proposes a methodology to automatically find an optimal sequence of processing steps which produce a material structure that bears the desired properties. the overall task is split in two steps: first find a mapping which delivers a set of structures with given properties and second, find an optimal process path to reach one of these structures with least effort. the first step is achieved by machine learning the generalized mapping of structures to properties in a supervised fashion, and then inverting this relation with methods delivering a set of goal structure solutions. the second step is performed via reinforcement learning of optimal paths by finding the processing sequence which leads to the best reachable goal structure. the paper considers steel processing as an example, where the microstructure is represented by orientation density functions and elastic and plastic material target properties are considered. the paper shows the inversion of the learned structure-property mapping by means of genetic algorithms. the search for structures is thereby regularized by a loss term representing the deviation from process-feasible structures. it is shown how reinforcement learning is used to find deformation action sequences in order to reach the given goal structures, which finally lead to the required properties. keywords: computational materials science, property-structure-mapping, texture evolution optimization, machine learning, reinforcement learning the derivation of processing control actions to produce materials with certain, desired properties is the "inverse problem" of the causal chain "process control" -"microstructure instantiation" -"material properties". the main goal of our current project is the creation of a new basis for the solution of this problem by using modern approaches from machine learning and optimization. the inversion will be composed of two explicitly separated parts: "inverse structure-property-mapping" (spm) and "microstructure evolution optimization". the focus of the project lies on the investigation and development of methods which allow an inversion of the structure-property-relations of materials relevant in the industry. this inversion is the basis for the design of microstructures and for the optimal control of the related production processes. another goal is the development of optimal control methods yielding exactly those structures which have the desired properties. the developed methods will be applied to sheet metals within the frame of the project as a proof of concept. the goals include the development of methods for inverting technologically relevant "structure-property-mappings" and methods for efficient microstructure representation by supervised and unsupervised machine learning. adaptive processing path-optimization methods, based on reinforcement learning, will be developed for adaptive optimal control of manufacturing processes. we expect that the results of the project will lead to an increasing insight into technologically relevant process-structure-property-relationships of materials. the instruments resulting from the project will also promote the economically efficient development of new materials and process controls. in general, approaches to microstructure design make high demands on the mathematical description of microstructures, on the selection and presentation of suitable features, and on the determination of structure-property relationships. for example, the increasingly advanced methods in these areas enable microstructure sensitive design (msd), which is introduced in [1] and [2] and described in detail in [3] . the relationship between structures and properties descriptors can be abstracted from the concrete data by regression in the form of a structure-property-mapping. the idea of modeling a structure-property-mapping by means of regression and in particular using artificial neural networks was intensively pursued in the 1990s [4] and is still used today. the approach and related methods presented in [5] always consist of a structure-property-mapping and an optimizer (in [5] genetic algorithms) whose objective function represents the desired properties. the inversion of the spm can be alternatively reached via generative models. in contrast to discriminative models (e.g. spm), which are used to map conditional dependencies between data (e.g. classification or regression), generative models map the composite probabilities of the variables and can thus be used to generate new data from the assumed population. established, generative methods are for example mixture models [6] , hidden markov models [7] and in the field of artificial neural networks restricted boltzmann machines [8] . in the field of deep learning, generative models, in particular generative adversarial networks [9] , are currently being researched and successfully applied in the context of image processing. conditional generative models can generalize the probability of occurrence of structural features under given material properties. in this way, if desired, any number of microstructures could be generated. based on the work on the spm, the process path optimization in the context of the msd is treated depending on the material properties. for this purpose, the process is regarded as a sequence of structure-changing process operations which correspond to elementary processing steps. shaffer et al. [10] construct a so called texture evolution network based on process simulation samples, to represent the process. the texture evolution network can be considered as a graph with structures as vertices, connected by elementary processing steps as edges. the structure vertices are points in the structure-space and are mapped to the property-space by using the spm for property driven process path optimization. in [11] one-step deformation processes are optimized to reach the most reachable element of a texture-set from the inverse spm. processes are represented by so called process planes, principal component analysis (pca) projections of microstructures reachable by the process. the optimization then is conducted by searching for the process plane which best represents one of the texture-set elements. in [12] , a generic ontology based semantic system for processing path hypothesis generation (matcalo) is proposed and showcased. the required mapping of the structures to the properties is modeled based on data from simulations. the simulations are based on taylor models. the structures are represented using textures in the form of orientation density functions (odf), from which the properties are calculated. in the investigations, elastic and plastic properties are considered in particular. structural features are extracted from the odf for a more compact description. the project uses spectral methods such as generalized spherical harmonics (gsh) to approximate the odf. as an alternative representation we investigate the discretization in the orientation-space, where the orientation density is represented by a histogram. the solution of the inverse problem consists of a structure-property-mapping and an optimizer: as [4] described, the spm is modeled by regression using artificial neural networks. in this investigation, we use a multilayer perceptron. differential evolution (de) is used for the optimization problem. de is an evolutionary algorithm developed by rainer storn and kenneth price [13] . it is a optimization method, which repeatedly improves a candidate solution set under consideration of a given quality measure over a continuous domain. the de algorithm optimizes a problem by taking a population of candidate solutions and generating new candidate solutions (structures) by mutation and recombination existing ones. the candidate solution with the best fitness is considered for further processing. so, for the generated structures the reached properties are determined using the spm. the fitness f is composed of two terms: the property loss l p , which expresses, how close the property of a candidate is to the target property, and the structure loss l s , which represents the degree of feasibility of the candidate structure in the process the property loss is the mean squared error (mse) between the reached properties p r ∈ p r and the desired properties p d ∈ p d : considering the goal that the genetic algorithm generates reachable structures, a neural network is formed which functions as an anomaly detector. the data basis of this neural network are structures that can be reached by a process. the goal of anomaly detection is to exclude unreachable structures. the anomaly detection is implemented using an autoencoder [14] . this is a neural network (see fig. 1 ) which consists of the following two parts: the encoder and the decoder. the encoder converts the input data to an embedding space. the decoder converts the embedding space as close as possible to the original data. due to the reduction to an embedding space, the autoencoder uses data compression and extracts relevant features. the cost function for the structures is a distance function in the odf-space, which penalizes the network if it produces outputs that differ from the input. the cost function is also known as the reconstruction loss: with s i ∈ s as the original structures,ŝ i ∈ˆ s as the reconstructed structures and λ = 0.001 to avoid division by zero. when using the anomaly detection, the autoencoder determines a high reconstruction loss if the input data are structures that are very different from the reachable structures. the overall approach is shown in fig. 2 and consists of the following steps: 1. the genetic algorithm generates structures. 2. the spm determines the reached properties of the generated structures. 3. the structure loss l s is determined by the reconstruction loss of the anomaly detector for the generated structures with respect to the reachable structures. 4. the property loss l p is determined by the mse of the reached properties and the desired properties. 5. the fitness is calculated as the sum of the structure loss l s and the property loss l p . the structures, resulting from the described approach form the basis for optimal process control. due to the forward mapping, the process evolution optimization based on texture evolution networks ( [10] ) is restricted to a-priori sampled process paths. [11] relies on linearization assumptions and is applicable to short process sequences only. [12] relies on a-priori learned process models in the form of regression trees and is also applicable to relatively short process sequences only. ur-ai 2020 // 88 as an adaptive alternative for texture evolution optimization, that can be trained to find process-paths of arbitrary length, we propose methods from reinforcement learning. for desired material properties p d . the inverted spm determines a set of goal microstructures s d ∈ g, which are very likely reachable by the considered deformation process. the texture evolution optimization objective is then to find the shortest process path p * starting from a given structure s 0 , and leading close to one of the structures from g. where p = (a k ) k=0,...,k ; k t is a path of process actions a, t is the maximum allowed process length. the mapping e(s, p) = s k delivers the resulting structure, when applying p to the structure s. here, for the sake of simplicity, we assume the process to be deterministic, although the reinforcement learning methods we use are not restricted to deterministic processes. g τ is a neighbourhood of g, the union of all open balls with radius τ and center points from g. to solve the optimization problem by reinforcement learning approaches, it must be reformulated as markov decision process (mdp), which is defined by the tuple (s, a, p, r). in our case s is the space of structures s, a is the parameter-space of the deformation process, containing process actions a, p : s × a → s is the transition function of the deformation process, which we assume to be deterministic. r g : s × a → r is a goalspecific reward function. the objective of the reinforcement learning agent is then to find the optimal goal-specific policy π * g (s t ) = a t that maximizes the discounted future goal-specific reward where γ ∈ [0, 1] discounts early attained rewards, the policy π g (s k ) determines a k and the transition function p (s k , a k ) determines s k+1 . for a distance function d in the structure space, the binary reward function r g (s, a) = 1, if d(p (s, a), g) < τ 0, otherwise (6) if maximized, leads to an optimal policy π * g that yields the shortest path to g from every s for γ < 1. moreover, if v g is given for every microstructure from g, p from eq. 4 is identical with the application of the policy π * ζ , where ζ = arg max g [v g ]. π * g can be approached by methods from reinforcement learning. value-based reinforcement learning is doing so by learning expected discounted future reward functions [15] . one of these functions is the so called value-function v . in the case of a deterministic mdp and for a given g, this expectation value function reduces to v g from eq. 4 and ζ can be extracted if v is learned for every g from g. for doing so, a generalized form of expectation value functions can be learned as it is done e.g. in [16] . this exemplary mdp formulation shows how reinforcement learning can be used for texture evolution optimization tasks. the optimization thereby is operating in the space of microstructures and does not rely on a-priori microstructure samples. when using off-policy reinforcement learning algorithms and due to the generalization over goal-microstructures, the functions learned while solving a specific optimization task can be easily transferred to new optimization tasks (i.e. different desired properties or even a different property space). industrial robots are mainly deployed in large-scale production, especially in the automotive industry. today, there are already 26.1 industrial robots deployed per 1,000 employees on average in these industry branches. in contrast, small and medium-sized enterprises (smes) only use 0.6 robots per 1,000 employees [1] . reasons for this low usage of industrial robots in smes include the lack of flexibility with great variance of products and the high investment expenses due to additional peripherals required, such as gripping or sensor technology. the robot as an incomplete machine accounts for a fourth of the total investment costs [2] . due to the constantly growing demand of individualized products, robot systems have to be adapted to new production processes and flows [3] . this development requires the flexibilization of robot systems and the associated frequent programming of new processes and applications as well as the adaption of existing ones. robot programming usually requires specialists who can adapt flexibly to different types of programming for the most diverse robots and can follow the latest innovations. in contrast to many large companies, smes often have no in-house expertise and a lack of prior knowledge with regard to robotics. this often has to be obtained externally via system integrators, which, due to high costs, is one of the reasons for the inhibited use of robot systems. during the initial generation or extensive adaption of process flows with industrial robots, there is a constant risk of injuring persons and damaging the expensive hardware components. therefore, the programs have to be tested under strict safety precautions and usually in a very slow test mode. this makes the programming of new processes very complex and therefore time-and cost-intensive. the concept presented in this paper combines intuitive, gesture-based programming with simulation of robot movements. using a mixed reality solution, it is possible to create a simulation-based visualization of the robot and project, to program and to test it in the working environment without disturbing the workflow. a virtual control panel enables the user to adjust, save and generate a sequence of specific robot poses and gripper actions and to simulate the developed program. an interface to transfer the developed program to the robot controller and execute it by the real robot is provided. the paper is structured as follows. first, a research on related work is conducted in section 2, followed by a description of the system of the gesture-based control concept in section 3. the function of robot positioning and program creation is described in section 4. last follow the evaluation in section 5 and conclusion in section 6. various interfaces exist to program robots, such as lead-trough, offline or walk-trough programming, programming by demonstration, vision based programming or vocal commanding. in the survey of villani et al. [4] a clear overview on existing interfaces for robot programming and current research is provided. besides the named interfaces, the programming of robots using a virtual or mixed reality solution aims to provide intuitiveness, simplicity and accessibility of robot programming for non-experts. designed for this purpose, guhl et al. [5] developed a generic architecture for human-robot interaction based on virtual and mixed reality. in the marker tracking based approach presented by [6] and [7] , the user defines a collision-free-volume and generates and selects control points while the system creates and visualizes a path through the defined points. others [8] , [9] , [10] and [11] use handheld devices in combination with gesture control and motion tracking. herein, the robot can be controlled through gestures, pointing or via the device, while the path, workpieces or the robot itself are visualized on several displays. other gesture and virtual or mixed reality based concepts are developed by cousins et al. [12] or tran et al. [13] . here, the robots perspective or the robot in the working environment is presented to the user on a display (head-mounted or stationary) and the user controls the robot via gestures. further concepts using a mixed reality method enable an image of the workpiece to be imported into cad and the system automatically generates a path for robot movements [14] or visualizing the intended motion of the robot on the microsoft hololens, that the user knows where the robot will move to next [15] . other methods combine pointing at objects on an screen with speech instructions to control the robot [16] . sha et al. [17] also use a virtual control panel in their programming method, but for adjusting parameters and not for controlling robots. another approach pursues programming based on cognition, spatial augmented reality and multimodal input and output [18] , where the user interacts with a touchable table. krupke et al. [19] developed a concept in which humans can control the robot by head orientation or by pointing, both combined with speech. the user is equipped with a head-mounted display presenting a virtual robot superimposed over the real robot. the user can determine pick and place position by specifying objects to be picked by head orientation or by pointing. the virtual robot then executes the potential pick movement and after the user confirms by voice command, the real robot performs the same movement. a similar concept based on gesture and speech is persued by quintero et al. [20] , whose method offers two different types of programming. on the one hand, the user can determine a pick and place position by head orientation and speech commands. the system automatically generates a path which is displayed to the user, can be manipulated by the user and is simulated by a virtual robot. on the other hand, it is possible to create a path on a surface by the user generating waypoints. ostanin and klimchik [21] introduced a concept to generate collision-free paths. the user is provided with virtual goal points that can be placed in the mixed reality environment and between which a path is automatically generated. by means of a virtual menu, the user can set process parameters such as speed, velocity etc.. additionally, it is possible to draw paths with a virtual device and the movement along the path is simulated by a virtual robot. differently to the concept described in this paper, only a pick and place task can be realized with the concepts of [19] and [20] . a differentiation between movements to positions and gripper commands as well as the movement to several positions in succession and the generation of a program structure are not supported by these concepts. another distinction is that the user only has the possibility to show certain objects to the robot, but not to move the robot to specific positions. in [19] a preview of the movement to be executed is provided, but the entire program (pick and place movements) is not simulated. in contrast to [21] , with the concept presented in this paper it is possible to integrate certain gripper commands into the program. with [21] programming method, the user can determine positions but exact axis angles or robot poses cannot be set. overall, the approach presented in this paper offers an intuitive, virtual user interface without the use of handheld devices (cf. [6] , [7] , [8] , [9] , [10] and [11] ) which allows the exact positions of the robot to be specified. compared to other methods, such as [12] , [13] , [14] , [15] or [16] , it is possible to create more complex program structures, which include the specification of robot poses and gripper positions, and to simulate the program in a mixed reality environment with a virtual robot. in this section the components of the mixed reality robot programming system are introduced and described. the system consists of multiple real and virtual interactive elements, whereby the virtual components are projected directly into the field of view using a mixed reality (mr) approach. compared to the real environment, which consists entirely of real objects and virtual reality (vr), which consists entirely of virtual objects and which overlays the real reality, in mr the real scene here is preserved and only supplemented by the virtual representations [22] . in order to interact in the different realities, head-mounted devices similar to glasses, screens or mobile devices are often used. figure 1 provides an overview of the systems components and their interaction. the system presented in this paper includes kukas collaborative, lightweight robot lbr iiwa 14 r820 combined with an equally collaborative gripper from zimmer as real components and a virtual robot model and a user interface as virtual components. the virtual components are presented on the microsoft hololens. for calculation and rendering the robot model and visualization of the user interface, the 3d-and physics-engine of the unity3d development framework is used. furthermore, for supplementary functions, components and for building additional mr interactable elements, the microsoft mixed reality toolkit (mrtk) is utilized. for spatial positioning of the virtual robot, marker tracking is used, a technique supported by the vuforia framework. in this use case, the image target is attached to the real robot's base, such that in mr the virtual robot superimposes the real robot. the program code is written in c . the robot is controlled and programmed via an intuitive and virtual user interface that can be manipulated using the so-called airtap gesture, a gesture provided by microsoft hololens. ur-ai 2020 // 95 to ensure that the virtual robot mirrors the motion sequences and poses of the real robot, the most exact representation of the real robot is employed. the virtual robot consists of a total of eight links, matching the base and the seven joints of iiwa 14 r820: the base frame, five joint modules, the central hand and the media flange. the eight links are connected together as a kinematic chain. the model is provided as open source files from [23] and [24] and is integrated into the unity3d project. the individual links are created as gameobjects in a hierarchy, with the base frame defining the top level and are limited similar to those of the real robot. the cad data of the deployed gripping system is also imported into unity3d and linked to the robot model. the canvas of the head-up displayer of the microsoft hololens is divided into two parts and rendered at a fixed distance in front of the user and on top of the scene. at the top left side of the screen the current joint angles (a1 to a7) are displayed and on the left side the current program is shown. this setting simplifies the interaction with the robot as the informations do not behave like other objects in the mr scene, but are attached to the head up display (hud) and move with the user's field of view. the user interface, which consists of multiple interactable components, is placed into the scene and is shown at the right side of the head-up display. at the beginning of the application the user interface is in "clear screen" mode, i.e. only the buttons "drag", "cartesian", "joints", "play" and "clear screen" and the joint angles at the top left of the screen are visible. for interaction with the robot, the user has to switch into a particular control mode by tapping the corresponding button. the user interface provides three different control modes for positioning the virtual robot: -drag mode, for rough positioning, -cartesian mode, for cartesian positioning and -joint mode, for the exact adjustment of each joint angle. figure 2 shows the interactable components that are visible and therefore controllable in the respective control modes. depending on the selected mode, different interactable components become visible in the user interface, with whom the virtual robot can be controlled. in addition to the control modes, the user interface offers further groups of interactable elements: -motion buttons, with which e.g. the speed of the robot movement can be adjusted or the robot movement can be started or stopped, -application buttons, to save or delete specific robot poses, for example, -gripper buttons, to adjust the gripper and -interface buttons, that enable communication with the real robot. this section focuses on the description of the usage of the presented approach. in addition to the description of the individual control modes, the procedure for creating a program is also described. as outlined in section 3.2, the user interface consists of three different control modes and four groups of further interactable components. through this concept, the virtual robot can be moved efficiently to certain positions with different movement modes, the gripper can be adjusted, the motion can be controlled and a sequence of positions can be chained. drag by gripping the tool of the virtual robot with the airtap gesture, the user can "drag" the robot to the required position. additionally, it is possible to rotate the position of the robot using both hands. this mode is particularly suitable for moving the robot very quickly to a certain position. cartesian this mode is used for the subsequent positioning of the robot tool with millimeter precision. the tool can be translated to the required positions using the cartesian coordinates x, y, z and the euler angles a, b, c. the user interface provides a separate slider for each of the six translation options.the tool of the robot moves analogously to the respective slider button, which the user can set to the required value. joints this mode is an alternative to the cartesian method for exact positioning. the joints of the virtual robot can be adjusted precisely to the required angle, which is particularly suitable for e.g. bypassing an obstacle. there is a separate slider for each joint of the virtual robot. in order to set the individual joint angles, the respective slider button is dragged to the required value, which is also displayed above the slider button for better orientation. to program the robot, the user interface provides various application buttons, such as saving and removing robot poses from the chain and a display of the poses in the chain. the user directs the virtual robot to the desired position and confirms using the corresponding button. the pose of the robot is then saved as joint angles from a1 to a7 and one gripper position in a list and is displayed on the left side of the screen. when running the programmed application, the robot moves to the saved robot poses and gripper positions according to the defined sequence. for a better orientation, the robots current target position changes its color from white to red. after testing the application, the list of robot poses can be sent to the controller of the real robot via a webservice. the real robot then moves analogously to the virtual robot to the corresponding robot poses and gripper positions. the purpose of the evaluation is how the gesture-based control concept compares to other concepts regarding intuitiveness, comfort and complexity. for the evaluation, a study was conducted with seven test persons, who had to solve a pick and place task with five different operating concepts and subsequently evaluate them. the developed concept based on gestures and mr was evaluated against a lead through procedure, programming with java, programming with a simplified programming concept and approaching and saving points with kuka smartpad. the test persons had no experience with microsoft hololens and mr, no to moderate experience with robots and no to moderate programming skills. the questionnaire for the evaluation of physical assistive devices (quead) developed by schmidtler et al [25] was used to evaluate and compare the five control concepts. the questionnaire is classified into five categories (perceived usefulness, perceived ease of use, emotions, attitude and comfort) and contains a total of 26 questions, rated on an ordinal scale from 1 (entirely disagree) to 7 (entirely agree). firstly, each test person received a short introduction to the respective control concept, conducted the pick and place task and immediately afterwards evaluated the respective control concept using quead. all test persons agreed that they would reuse the concept in future tasks (3 mostly agree, 4 entirely agree). in addition, the test persons considered the gesture-based concept to be intuitive (1 mostly agree, 4 entirely agree), easy to use (5 mostly agree, 2 entirely agree) and easy to learn (1 mostly agree, 6 entirely agree). two test persons mostly agree and four entirely agree that the gesture-based concept enabled them to solve the task efficiently and four test persons mostly agree and two entirely agree that the concept enhances their work performance. all seven subjects were comfortable using the gesturebased concept (4 mostly agree, 2 entirely agree). overall, the concept presented in this paper was evaluated as more comfortable, more intuitive and easier to learn than the other control concepts evaluated. in comparison to them, the new operating concept was perceived as the most useful and easiest to use. the test persons felt physically and psychologically most comfortable when using the concept and were most positive in total. in this paper, a new concept for programming robots based on gestures and mr and for simulating the created applications was presented. this concept forms the basis for a new, gesture-based programming method, with which it is possible to project a virtual robot model of the real robot into the real working environment by means of a mr solution, to program it and to simulate the workflow. using an intuitive virtual user interface, the robot can be controlled by three control modes and further groups of interactable elements and via certain functions, several robot positions can be chained as a program. by using this concept, test and simulation times can be reduced, since on the one hand the program can be tested directly in the mr environment without disturbing the workflow. on the other hand, the robot model is rendered into the real working environment via the mr approach, thus eliminating the need for time-consuming and costly modeling of the environment. the results of the user study indicate that the control concept is easy to learn, intuitive and easy to use. this facilitates the introduction of robots and especially in smes, since no expert knowledge is required for programming, programs can be created rapidly and intuitively and processes can be adapted flexibly. in addition, the user study showed that tasks can be solved efficiently and the concept is perceived as performance-enhancing. potential directions of improvement are: implement various movement types, such as point-to-point, linear and circular movements in the concept. this makes the robot motion more flexible and efficient, since positions can be approached in different ways depending on the situation. another improvement is to extend the concept with collaborative functions of the robot, such as force sensitivity or the ability to conduct search movements. in this way, the functions that make collaborative robots special can be integrated into the program structure. a further approach for improvement is to engage in a larger scale study. in 2019 the world's commercial fleet consists of 95,402 ships with a total capacity of 1,976,491 thousand dwt. (a plus of 2.6 % in carrying capacity compared to last year) [1] . according to the international chamber of shipping, the shipping industry is responsible for about 90 % of all trade [2] . in order to ensure the safe voyage of all participant in the international travel at sea, the need for monitoring is steadily increasing. while more and more data regarding the sea traffic is collected by using cheaper and more powerful sensors, the data still needs to be processed and understood by human operators. in order to support the operators, reliable anomaly detection and situation recognition systems are needed. one cornerstone for this development is a reliable automatic classification of vessels at sea. for example by classifying the behaviour of non cooperative vessels in ecological protected areas, the identification of illegal, unreported and unregulated (iuu) fishing activities is possible. iuu fishing is in some areas of the world a major problem, e. g., »in the wider-caribbean, western central atlantic region, iuu fishing compares to 20-30 percent of the legitimate landings of fish« [3] resulting in an estimated value between usd 700 and 930 million per year. one approach for gathering information on the sea traffic is based on the automatic identification system (ais) 3 . it was introduced as a collision avoidance system. as each vessel is broadcasting its information on an open channel, the data is often used for other purposes, like training and validating of machine learning models. ais provides dynamic data like position, speed and course over ground, static data like mmsi 4 , shiptype and length, and voyage related data like draught, type of cargo, and destination about a vessel. the system is self-reporting, it has no strong verification of transmission, and many of the fields in each message are set by hand. therefore, the data can not be fully trusted. as harati-mokhtari et al. [4] stated, half of all ais messages contain some erroneous data. as for this work, the dataset is collected by using the ais stream provided by aishub 5 , the dataset is likely to have some amount of false data. while most of the errors will have no further consequences, minor coordinate inaccuracies or wrong vessel dimensions are irrelevant, some false information in vessel information can have an impact on the model performance. classification of maritime trajectories and the detection of anomalies is a challenging problem, e.g., since classifications should be based on short observation periods, only limited information is available for vessel identification. riveiro et al. [5] give a survey on anomaly detection at sea, where shiptype classification is a subtype. jiang et al. [6] present a novel trajectorynet capable of point-based classification. their approach is based on the usage of embedding gps coordinates into a new feature space. the classification itself is accomplished using an long short-term memory (lstm) network. further, jiang et al. [7] propose a partition-wise lstm (plstm) for point-based binary classification of ais trajectories into fishing or non-fishing activity. they evaluated their model against other recurrent neural networks and achieve a significantly better result than common recurrent network architectures based on lstm or gated recurrent units. a recurrent neural network is used by nguyen et al. in [8] to reconstruct incomplete trajectories, detect anomalies in the traffic data and identify the real type of a vessel. they are embedding the position data to generate a new representation as input for the neural network. besides these neural network based approaches, other methods are also used for situation recognition tasks in the maritime domain. especially expert-knowledge based systems are used frequently, as illegal or at least suspicious behaviour is not recorded as often as desirable for deep learning approaches. conditional random fields are used by hu et al. [9] for the identification of fishing activities from ais data. the data has been labelled by an expert and contains only longliner fisher boats. saini et al. [10] propose an hidden markov model (hmm) based approach to the classification of trajectories. they combine global-hmm and segmental-hmm using a genetic algorithm. in addition, they tested the robustness of the framework by adding gaussian noise. in [11] fischer et al. introduce a holistic approach for situation analysis based on situation-specific dynamic bayesian networks (ssdbn). this includes the modelling of the ssdbn as well as the presentation to end-users. for a bayesian network, the parametrisation of the conditional probability tables is crucial. fischer introduces an algorithm for choosing these parameters in a more transparent way. important for the functionality is the ability of the network to model the domain knowledge and the handling of noisy input data. for the evaluation, simulated and real data is used to assess the detection quality of the ssdbn. based on dbns, anneken et al. [12] implemented an algorithm for detecting illegal diving activities in the north sea. as explained by de rosa et al. [13] an additional layer for modelling the reliability of different sensor sources is added to the dbn. in order to use the ais data, preprocessing is necessary. this includes cleaning wrong data, filtering data, segmentation, and calculation of additional features. the whole workflow is depicted in figure 1 . the input in form of ais data and different maps is shown as blue boxes. all relevant mmsis are extracted from the ais data. for each mmsi, the position data is used for further processing. segmentation into separate trajectories is the next step (yellow). the resulting trajectories are filtered (orange). based on the remaining trajectories, geographic (green) and trajectory (purple) based features are derived. for each of the resulting sequences, the data is normalized (red), which results in the final dataset. only the 6 major shiptypes in the dataset are used for the evaluation. these are "cargo", "tanker", "fishing", "passenger", "pleasure craft" and "tug". due to their similar behaviour, "cargo" and "tanker" will combined to a single class "cargo-tanker". figure 1 : visualization of all preprocessing steps. input in blue, segmentation in yellow, filtering in orange, geographic features in green, trajectory feature in purple and normalization in red. four different trajectory features are used: ur-ai 2020 // 105 -time difference -speed over ground -course over ground -trajectory transformation as the incoming data from ais is not necessarily uniformly distributed in time, there is a need to create a feature representing the time dimension. therefore, the time difference between two samples is introduced. as the speed and course over ground is directly accessible through the ais data, the network will be directly fed with these features. the vessel's speed is a numeric value in 0.1-knot resolution in the interval [0; 1022] and the course is the negative angle in degrees relative to true north and therefore in the interval [0; 359]. the position will be transformed in two ways. the first transformation, further called "relative-to-first", will shift the trajectory to start at the origin. the second transformation, henceforth called "rotate-to-zero", will rotate the trajectory, in such a way, that the end point is on the x-axis. additional to the trajectory based features, two geographic features are derived by using coastline maps 6 and a map of large harbours. the coastline map consists of a list of line strips. in order to reduce complexity, the edge points are used to calculate the "distance-to-coast". further, only a lower resolution of the shapefile itself is used. in figure 2 , the resolution "high" and "low" for some fjords in norway are shown. due to the geoindex' cell size set to 40 km, a radius of 20 km can be queried. the world's 140 major harbours based on the world port index 7 are used to calculate the "distance-to-closest-harbor". as fishing vessels are expected to stay near to a certain harbour, this feature should support the network to identify some shiptypes. the geoindex' cell size is set for this feature to 5,000 km, resulting in a maximum radius of 2,500 km. the data is split into separate trajectories by using gaps in either time or space, or the sequence length. as real ais data is used, package loss during the transmission is common. this problem is tackled by splitting the data if the time between two successive samples is larger than 2 hours, or if the distance between two successive samples is large. regarding the distance, even though the great circle distance is more accurate, the euclidean distance is used. for simplification the distance value is squared and as a threshold 10 −4 is used. depending on latitude this corresponds to a value of about 1 km at the equator and only about 600 m at 60 • n. since the calculation includes approximation a relatively high threshold is chosen. as the neural network depends on a fixed input size, the data is split into fitting chunks by cutting and padding with these rules: -longer sequences are split into chunks according to the desired sequence length. -any left over sequence shorter than 80 % of the desired length is discarded. -the others will be padded with zeroes. this results in segmented trajectories of similar but not necessarily same duration. as this work is about the vessel behaviour at sea, stationary vessels (anchored and moored vessels) and vessels traversing rivers are removed from the segmented trajectories. the stationary vessels are identified by using a measure of movement in a trajectory: where n as the sequence length and p i its data points. a trajectory will be removed if α stationary is below a certain threshold. a shapefile 8 containing the major and most minor rivers (compare ??) is used in order to remove the vessels not on the high seas. a sequence with more than 50 % of its points on a river is removed from the dataset. in order to speed up the training process, the data is normalized in the interval [0; 1] by applying here, for the positional features a differentiation between "global normalization" and "local normalization" is taken into account. the "global normalization" will scale the input data for the maximum x max and minimum x min calculated over the entire data set, while "local normalization" will estimate the maximum x max and minimum x min only over the trajectory itself. as the data is processed parallel, the parameters for the "global normalization" will be calculated only for each chunk of data. this will result in slight deviations in the minimum and maximum, but for large batches this should be neglectable. all other additional features are normalized as well. for the geographic features "distance-to-coast" and "distance-to-closest-harbor" the maximum distance, that can be queried depending on grid size, is used as x max and 0 is used as the lower bound x min . the time difference feature is scaled using a minimum x min of 0 and the threshold for the temporal gap since this is the maximum value for this feature. speed and course are normalized using 0 and their respective maximum values. for the dataset, a period between 2018-07-24 and 2018-11-15 is used. altogether 209,536 unique vessels with 2,144,317,101 raw data points are included. using this foundation and the previously described methods, six datasets are derived. all datasets use the same spatial and temporal thresholds. in addition, filter thresholds are identical as well. the datasets differentiate in their sequence length and by applying only the "relativeto-first" transformation or additionally the "rotate-to-zero" transformation. either 360, 1,080, or 1,800 points per sequence are used resulting in approximate 1 h, 3 h, or 5 h long sequences. in figure 3 , the distribution of shiptypes in the datasets after applying the different filters is shown. for the shiptype classification, neural networks are chosen. the different networks are implemented using keras [14] with tensorflow as backend [15] . fawaz et al. [16] have shown, that, despite their initial design for image data, a residual neural network (resnet) can perform quite well on time-series classification. thus, as foundation for the evaluated architectures the resnet is used. the main difference to other neural network architectures is the inclusion of "skip connections". this allows for deeper networks by circumventing the vanishing gradient problem during the training phase. based on the main idea of a resnet, several architectures are designed and evaluated for this work. some information regarding the structure are given in table 1 . further, the single architectures are depicted in figures 4a to 4f . the main idea behind these architectures is to analyse the impact of the depth of the networks. furthermore, as the features itself are not necessarily logically linked with each other, the hope is to be able to capture the behaviour better by splitting up the network path for each feature. to verify the necessity of cnns two multilayer perceptron (mlp) based networks are tested: one with two hidden layers and one with four hidden layers, all with 64 neurons and fully connected with their adjacent layers. the majority of the parameters for the two networks are bound in the first layer. they are necessary to map the large number of input neurons, e. g., for the 360 samples dataset 360 * 9 = 3,240 input neurons, to the first hidden layer. each of the datasets is split into three parts: 64 % for the training set, 16 % for the validation set, and 20 % for the test set. for solving or at least mitigating the problem of overfitting, regularization techniques (input noise, batch normalization, and early stopping) are used. small noise on the input in the training phase is used to support the generalization of the network. for each feature a normal distribution with a standard deviation of 0.01 and a mean of 0 is used as noise. furthermore, batch normalization is implemented. this means, before each relulayer a batch normalization layer is added, allowing higher learning rates. therefore, the initial learning rate is doubled. additionally, the learning rate is halved if the validation error does not improve after ten training epochs, improving the training behaviour during oscillation on a plateau. in order to prevent overfitting, an early stopping criteria is introduced. the training will be interrupted if the validation error is not decreasing after 15 training epochs. to counter the dataset imbalance, class weights were considered but ultimately did not lead to better classification results and were discarded. the different neural network architectures are evaluated on a amd ryzen threadripper batch normalization and the input noise is tested. the initial learning rate is set to 0.001 without batch normalization and 0.002 with batch normalization activated. the maximum number of epochs is set to 600. the batch sizes are set to 64, 128, and 256 for 360, 1,080, and 1,800 samples per sequence respectively. in total 144 different setups are evaluated. furthermore, 4 additional networks are trained on the 360 samples dataset with "relative-to-first" transformation. two mlps to verify the need of deep neural networks, and the shallow and deep resnet trained without geographic features to measure the impact of these features. (f) "rtz" with 1,800 samples shown. the first row shows the results for the "relative-to-first" (rtf) transformation, the second for the "rotate-to-zero" (rtz) transformation. the results for the six different architectures are depicted in figure 5 . for 360 samples the shallow resnet and the deep resnet outperformed the other networks. in case of the "relative-to-first" transformation (see figure 5a ), the shallow resnet achieved an f 1 -score of 0.920, while the deep resnet achieved 0.919. for the "rotate-to-zero" transformation (see figure 5d ), the deep resnet achieved 0.918 and the shallow resnet 0.913. in all these cases the regularization methods lead to no improvements. the "relative-to-first" transformation performs slightly better overall. for the datasets with 360 samples per sequence, the standard resnet variants achieve higher f 1 -scores compared to the split resnet versions. but this difference is relatively small. as expected, the tiny resnet is not large and deep enough to classify the data on a similar level. for the "relative-first" transformation and trajectories based on 1080 samples (see figure 5b ), the split resnet and the total split resnet achieve the best results. the first performed well with an f 1 -score of 0.913, while the latter is slightly worse with 0.912. in both cases again the regularization did not improve the result. for the "rotateto-zero" transformation (see figure 5e ), the shallow resnet achieved an f 1 -score of 0.907 without any regularization and 0.905 with only the the noise added to the input. for the largest sequence length of 1,800 samples, the split based networks slightly outperform the standard resnets. for the "relative-to-first" transformation (see figure 5c ), the split resnet achieved an f 1 -score of 0.911, while for the "rotate-to-zero" transformation (see figure 5f ) the total split resnet reached an f 1 -score of 0.898. again without noise and batch normalization. to verify, that the implementation of cnns is actually necessary, additional tests with mlps were carried out. two different mlps are trained on the 360 samples dataset with "relative-to-first" transformation since this dataset leads to best results for the resnet architectures. both networks lead to no results as their output always is the "cargo-tanker" class regardless of the actual input. the only thing the models are able to learn is, that the "cargo-tanker" class is the most probable class based on the uneven distribution of classes. an mlp is not the right model for this kind of data and performs badly. the large dimensionality of even the small sequence length makes the use of the fully connected networks impracticable. probably, further hand-crafted feature extraction is needed to achieve better results. to measure the impact the feature "distance to coast" and "distance to closest harbor" have on the overall performance, a shallow resnet and a deep resnet are trained on the 360 sample length data set with the "relative-to-first" transformation excluding these features. the trained networks have f 1 -scores of 0.888 and 0.871 respectively. this means, by including this features, we are able to increase the performance by 3.5 %. the "relative-to-first" transformation compared to the "rotate-to-zero" transformation yields the better results. especially, this is easily visible for the longest sequence length. a possible explanation can be seen in the "stationary" filter. this filter removes more trajectories for the "relative-to-first" transformation than for the additional "rotate-to-zero" transformation. a problem might be, that the end point is used for rotating the trajectory. this adds a certain randomness to the data, especially for round trip sequences. in some cases, the stretched deep resnet is not able to learn the classes. it is possible, that there is a problem with the structure of the network or the large number of parameters. further, there seems to be a problem with the batch normalization, as seen in figures 5c and 5e . the overall worse performance of the "rotate-to-zero" transformation could be because of the difference in the "stationary" filter. in the "rotate-to-zero" dataset, fewer sequences are filtered out. the filter leads to more "fishing" and "pleasure craft" sequences in relation to each other as described in section 3.6. this could also explain the difference in class prediction distribution since the network is punished more for mistakes in these classes because more classes are overall from this type. for the evaluation, the expectation based on previous work by other authors was, that the shorter sequence length should perform worse compared to the longer ones. instead the shorter sequences outperform the longer ones. the main advantages of the shorter sequences are essentially the larger number of sequences in the dataset. for example the 360 samples dataset with "relative-to-first" transformation contains about 2.2 million sequences, while the corresponding 1,800 sample dataset contains only approximately 250,000 sequences. in addition, the more frequent segmentation can yield more easily classifiable sequences: the behaviour of a fishing vessel in general contains different characteristics, like travelling from the harbour to the fishing ground, the fishing itself, and the way back. the travelling parts are similar to other vessels and only the fishing part is unique. a more aggressive segmentation will yield more fishing sequences, that will be easier to classify regardless of observation length. the shallow resnet has the overall best results by using the 360 samples dataset and the "relative-to-first" transformation. the results for this setup are shown in the confusion matrix in figure 6 . as expected, the tiny resnet is not able to compete with the others. the other standard resnet architectures performed well, especially on shorter sequences. the split architectures are able to perform better on datasets with longer sequences, with the shallow resnet achieving similar performance. comparing the number of parameters, all three architectures have about 400,000 the shallow resnet about 50,000 more, the total split resnet about 40,000 less. only on the dataset with more sequences, the deep resnet performs well. this correlates with the need of more information due to the larger parameter count. due to the reduced flexibility, the split architecture can be interpreted as a "head start". this means, that the network has already information regarding the structure of the data, which in turn does not need to be extracted from the data. this can result in a better performance for smaller datasets. all in all, the best results are always achieved by omitting the suggested regularization methods. nevertheless, the batch normalization had an effect on the learning rate and needed training epochs: the learning rate is higher and less epochs are needed before convergence. based on the resnet, several architectures are evaluated for the task of shiptype classification. from the initial dataset based on ais data with over 2.2 billion datapoints six datasets with different trajectory length and preprocessing steps are derived. further to the kinematic information included in the dataset, geographical features are generated. each network architecture is evaluated with each of the datasets with and without batch normalization and input noise. overall the best result is an f 1 -score of 0.920 with the shallow resnet on the 360 samples per sequence dataset and a shift of the trajectories to the origin. additionally, we are able to show, that the inclusion of geographic features yield an improvement in classification quality. the achieved results are quite promising, but there is still some room for improvement. first of all, the the sequence length used for this work might still be too long for real world use cases. therefore, shorter sequences should be tried. additionally, interpolation for creating data with the same time delta between two samples or some kind of embedding or alignment layer might yield better results. as there are many sources for additional domain related information, further research in the integration of these sources is necessary. comparison of cnn for the detection of small ojects based on the example of components on an assembly many tasks which only a few years ago had to be performed by humans can now be performed by robots or will be performed by robots in the near future. nevertheless, there are some tasks in assembly processes which cannot be automated in the next few years. this applies especially to workpieces that are only produced in very small series or tasks that require a lot of tact and sensitivity, such as inserting small screws into a thread or assembling small components. in conversations with companies we have found out that a big problem for the workers is learning new production processes. this is currently done with instructions and by supervisors. but this requires a lot of time. this effort can be significantly reduced by modern systems, which accompany workers in the learning process. such intelligent systems require not only instructions that describe the target status and the individual work steps that lead to it, but also information on the current status at the assembly workstation. one way to obtain this information is to install cameras above the assembly workstation and use image recognition to calculate where an object is located at any given moment. the individual parts, often very small compared to the work surface, must be reliably detected. we have trained and tested several deep neural networks for this purpose. we have developed an assembly workstation where work instructions can be projected directly onto the work surface using a projector. at a distance, 21 containers for components are arranged in three rows, slightly offset to the rear, one above the other. these containers can also be illuminated by the projector. thus a very flexible pick-by-light system can be implemented. in order for the system behind it to automatically switch to the next work step and, in the event of errors, to point them out and provide support in correcting them, it is helpful to be able to identify the individual components on the work surface. we use a realsense depth camera for this purpose, from which, however, we are currently only using the colour image. the camera is mounted in a central position at a height of about two meters above the work surface. thus the camera image includes the complete working surface as well as the 21 containers and a small area next to the working surface. the objects to be detected are components of a kit for the construction of various toy cars. the kit contains 25 components in total. some of the components vary considerably from each other, but some others are very similar to each other. since it is the same with real components of a production, the choice of the kit seemed appropriate for the purposes of this project. object detection, one of the most fundamental and challenging problems in computer vision, seeks to local object instances from a large number of predefined categories in natural images. until the beginning of 2000, a similar approach was mostly used in object detection. keypoints in one or more images of a category were searched for automatically. at these points a feature vector was generated. during the recognition process, keypoints in the image were again searched, the corresponding feature vectors were generated and compared with the stored feature vectors. after a certain threshold an object was assigned to the category. one of the first approaches based on machine learning was published by viola and jones in 2001 [1] . they still selected features, in their case they were selected by using a haar basis function [2] and then using a variant of adaboost [3] . starting in 2012 with the publication of alexnet by krizhevsky et al. [4] , deep neural networks became more and more the standard in object detection tasks. they used a convolutional neural network which has 60 million parameters in five convolutional layers, some of them are followed by max-pooling layers, three fully-connected layers and a final softmax layer. they won the imagenet lsvrc-2012 competition with a error rate almost half as high as the second best. inception-v2 is mostly identical to inception-v3 by szegedy et al. [5] . it is based on inception-v1 [6] . all inception architectures are composed of dense modules. instead of stacking convolutional layers, they stack modules or blocks, within which are convolutional layers. for inception-v2 they redesigned the architecture of inception-v1 to avoid representational bottlenecks and have more efficient computations by using factorisation methods. they are the first using batch normalisation in object detection tasks. in previous architectures the most significant difference has been the increasing number of layers. but with the network depth increasing, accuracy gets saturated and then degrades rapidly. kaiming et al. [7] addressed this problem with resnet using skip connections, while building deeper models. in 2017 howard et al. presented mobilenet architecture [8] . mobilenet was developed for efficient work on mobile devices with less computational power and is very fast. they used depthwise convolutional layers for a extremely efficient network architecture. one year later sandler et al. [9] published a second version of mobilenet. besides some minor adjustments, a bottleneck was added in the convolutional layers, which further reduced the dimensions of the convolutional layers. thus a further increase in speed could be achieved. in addition to the neural network architectures presented so far, there are also different methods to detect in which area of the image the object is located. the two most frequently used are described briefly below. to bypass the problem of selecting a huge number of regions, girshick et al. [10] proposed a method where they use selective search by the features of the base cnn to extract just 2000 regions proposals from the image. liu et al. [11] introduced the single shot multibox detector (ssd). they added some extra feature layers behind the base model for detection of default boxes in different scales and aspect ratios. at prediction time, the network generates scores for the presence of each object in each default box. then it produces adjustments to the box to better match the object shape. there is just one publication over the past few years which gives an survey of generic object detection methods. liu et al. [12] compared 18 common object detection architectures for generic object detection. there are many other comparisons of specific object detection tasks. for example pedestrian detection [13] , face detection [14] and text detection [15] . the project is based on the methodology of supervised learning. thereby the models are trained using a training dataset consisting of many samples. each sample within the training dataset is tagged with a so called label (also called annotation). the label provides the model with information about the desired output for this sample. during training, the output generated by the model is then compared to the desired output (labels) and the error is determined. this error on the one hand gives information about the current performance of the model and, on the other hand it is used for further mathematical computations to adjust the model's parameters, so that the model's performance improves. for the training of neural networks in the field of computer vision the following rule of thumb applies: the larger and more diverse the training dataset, the higher the accuracy that can be achieved by the trained model. if you have too little data and/or run it through the model too often, this can lead to so-called overfitting. overfitting means that instead of learning an abstract concept that can be applied to a variety of data, the model basically memorizes the individual samples [16, 17] . if you train neural networks for the purpose of this project from scratch, it is quite possible that you will need more than 100,000 different images -depending on the accuracy that the model should finally be able to achieve. however, the methodology of the so-called transfer learning offers the possibility to transfer results of neural networks, which have already been trained for a specific task, completely or partially to a new task and thus to save time and resources [18] . for this reason, we also applied transfer learning methods within the project. the training dataset was created manually: a tripod, a mobile phone camera (10 megapixel format 3104 x 3104) and an apeman action cam (20 megapixel format 5120x3840) were used to take 97 images for each of the 25 classes. this corresponds to 2,425 images in total (actually 100 images were taken per class, but only 97 were suitable for use as training data). all images were documented and sorted into close-ups (distance between camera and object less than or equal to 30 cm) and standards (distance between camera and object more than 30 cm). this procedure should ensure the traceability and controllability of the data set. in total, the training data set contains approx. 25% close-ups and approx. 75% standards, each taken on different backgrounds and under different lighting conditions (see fig. 2 ). the labelimg tool was used for the labelling of the data. with the help of this tool, bounding boxes, whose coordinates are stored in either yolo or pascval voc format, can be marked in the images [19] . for the training of the neural networks the created dataset was finally divided into: ur-ai 2020 // 118 -training data (90% of all labelled images): images that are used for the training of the models and that pass through the models multiple times during the training. -test data (10% of all labelled images): images that are used for later testing or validation of the training results. in contrast to the images used as training data, the model is presented these images for the first time after training. the goal of this approach, which is common in deep learning, is to see how well the neural network recognizes objects in images, that it has never seen before, after the training. thus it is possible to make a statement about the accuracy and to be able to meet any further training needs that may arise. the training of deep neural networks is very demanding on resources due to the large number of computations. therefore, it is essential to use hardware with adequate performance. since the computations that run for each node in the graph can be highly parallelized, the use of a powerful graphical processing unit (gpu) is particularly suitable. a gpu with its several hundred computing cores has a clear advantage over a current cpu with four to eight cores when processing parallel computing tasks [20] . these are the outline parameters of the project computer in use: -operating system (os): ubuntu 18.04.2 lts -gpu: geforce r gtx 1080 ti (11 gb gddr5x-memory, data transfer speed 11 gbit/s) for the intended comparison the tensorflow object detection api was used. tensorflow object detection api is an open source framework based on tensorflow, which among other things provides implementations of pre-trained object detection models for transfer learning [21, 22] . the api was chosen because of its good and easy to understand documentation and its variety of pre-trained object detection models. for the comparison the following models were selected: -ssd mobilenet v1 coco: [11, 23, 24] -ssd mobilenet v2 coco: [11, 25, 26] -faster rcnn inception v2 coco: [27] [28] [29] -rfcn resnet101 coco: [30] [31] [32] to ensure comparability of the networks, all of the selected pre-trained models were trained on the coco dataset [33] . fundamentally, the algorithms based on cnn models can be grouped into two main categories: region-based algorithms and one-stage algorithms [34] . while both ssd models can be categorized as one-stage algorithms, faster r-cnn and r-fcn fall into the category of region-based algorithms. one-stage algorithms predict both -the fields (or the bounding boxes) and the class of the contained objects -simultaneously. they are generally considered extremely fast, but are known for their trade-off between accuracy and real-time processing speed. region-based algorithms consist of two parts: a special region proposal method and a classifier. instead of splitting the image into many small areas and then working with a large number of areas like conventional cnn would proceed, the region-based algorithm first proposes a set of regions of interest (roi) in the image and checks whether one of these fields contains an object. if an object is contained, the classifier classifies it [34] . region-based algorithms are generally considered as accurate, but also as slow. since, according to our requirements, both accuracy and speed are important, it seemed reasonable to compare models of both categories. besides the collection of pre-trained models for object detection, the tensorflow object detection api also offers corresponding configuration files for the training of each model. since these configurations have already shown to be successful, these files were used as a basis for own configurations. the configuration files contain information about the training parameters, such as the number of steps to be performed during training, the image resizer to be used, the number of samples processed as a batch before the model parameters are updated (batch size) and the number of classes which can be detected. to make the study of the different networks as comparable as possible, the training of all networks was configured in such a way that the number of images fed into the network simultaneously (batch size) was kept as small as possible. since the configurations of some models did not allow batch sizes larger than one, but other models did not allow batch sizes smaller than two, no general value for all models could be defined for this parameter. during training, each of the training images should be passed through the net 200 times (corresponds to 200 epochs). the number of steps was therefore adjusted accordingly, depending on the batch size. if a fixed shape resizer was used in the base configurations, two different dimensions of resizing (default: 300x300 pixels and custom: 512x512 pixels) were selected for the training. table 1 gives an overview of the training configurations used for the training of the different models. in this section we will first look at the training, before we then focus on evaluating the quality of the results and the speed of the selected convolutional neural networks. when evaluating the training results, we first considered the duration that the neural networks require for 200 epochs (see fig. 3 ). it becomes clear that especially the two region based object detectors (faster r-cnn inception v2 and rfcn resnet101) took significantly longer than the single shot object detectors (ssd mobilenet v1 and ssd mobilenet v2). in addition, the single shot object detectors clearly show that the size of the input data also has a decisive effect on the training duration: while ssd mobilenet v2 with an input data size of 300x300 pixels took the shortest time for the training with 9 hours 41 minutes and 47 seconds, the same neural network with an input data size of 512x512 pixels took almost three hours more for the training, but is still far below the time required by rfcn resnet101 for 200 epochs of training. the next point in which we compared the different networks was accuracy (see fig. 4 ). we focused on seeing which of the nets were correct in their detections and how often (absolute values), and we also wanted to see what proportion of the total detections were correct (relative values). the latter seemed to us to make sense especially because some of the nets showed more than three detections for a single object. the probability that the correct classification will be found for the same object with more than one detection is of course higher in this case than if only one detection per object is made. with regard to the later use at the assembly table, however, it does not help us if the neural net provides several possible interpretations for the classification of a component. figure 4 shows that, in this comparison, the two region based object detectors generally perform significantly better than the single shot object detectors -both in terms of the correct detections and their share of the total detections. it is also noticeable that for the single shot object detectors, the size of the input data also appears to have an effect on the comparison point on the result. however, there is a clear difference to the previous comparison of the required training durations: while the training duration increased uniformly with increasing size of the images with the single shot object detectors, such a uniform observation cannot be made with the accuracy, concerning the relation to the input data sizes. while ssd mobilenet v2 achieves good results with an input data size of 512x512 pixels, ssd mobilenet v1 delivers the worst result of this comparison for the same input data size (regarding the number of correct detections as well as their share of the total detections). with an input data size of 300x300 pixels, however, the result improves with ssd mobilenet v1, while the change to a smaller input data size has a deteriorating effect on the result with ssd mobilenet v2. the best result of this comparison -judging by the absolute values -was achieved by faster r-cnn inception v2. however, in terms of the proportion of correct detections in the total detections, the region based object detector is two percentage points behind rfcn resnet 101, also a region based object detector. we were particularly interested in how neural networks would react to particularly similar, small objects. therefore, we decided to investigate the behavior of neural networks within the comparison using an example to illustrate the behavior of the three very similar objects. figure 5 shows the selected components for the experiment. for each of these three components we examined how often it was correctly detected and classified by the compared neural networks and how often the network misclassified it with which of the similar components. the first and the second component was detected in nearly all cases by both region based approaches. the classification by inception-v2 and resnet-101 failed in about one third of images. the ssd networks detected the object in just one of twenty cases but mobilenet classified this correct. it has been surprising, that the results for the third component looks very different to the others (see fig. 6 ). ssd mobilenet v1 correctly identified the component in seven of ten images and did not produce any detections that could be interpreted as misclassifications with one of the similar components. ssd mobilenet v2 did not detect any of the three components, as in the two previous investigations. the results of the two region based object detectors are rather moderate. faster r-cnn inception v2 has detected the correct component in four of ten images, but still five misclassifications with the other two components. rfcn resnet101 has caused many misclassifications with the other two components. only two of ten images were correctly detected but it had six misclassifications with the similar components. an other important aspect of the study is the speed, or rather the speed at which the neural networks can detect objects, especially with regard to later use at the assembly table. for the comparison of the speeds on the one hand the data of the github repository of the tensorflow object detection api for the individual neural nets were used, on the other hand the actual speeds of the neural nets within this project were measured. it becomes clear that the speeds measured in the project are clearly below the achievable speeds that are mentioned in the github repository of the tensorflow object-detection api. on the other hand, the differences between the speeds of the region based object detectors and the single shot object detectors in the project are far less drastic than expected. we have created a training dataset with small, partly very similar components. with this we have trained four common deep neural networks. in addition to the training times, we examined the accuracy and the recognition time with general evaluation data. in addition, we examined the results for ten images each of three very similar and small components. none of the networks we trained produced suitable results for our scenario. nevertheless, we were able to gain some important insights from the results. at the moment, the runtime is not yet suitable for our scenario, but it is also not far from the minimum requirements, so that these can easily be achieved with smaller optimizations and better hardware. it was also important to realize that there are no serious runtime differences between the different network architectures. the two region based approaches delivered significantly better results than the ssd approaches. however, the results of the detection of the third small component suggest that mobilenet in combination with a faster r-cnn could possibly deliver even better results. longer training and training data better adapted to the intended use could also significantly improve the results of the object detectors. team schluckspecht from offenburg university of applied sciences is a very successful participant of the shell eco marathon [1] . in this contest, student groups are to design and build their own vehicles with the aim of low energy consumption. since 2018 the event features the additional autonomous driving contest. in this area, the vehicle has to fulfill several tasks, like driving a parcour, stopping within a defined parking space or circumvent obstacles, autonomously. for the upcoming season, the schluckspecht v car of the so called urban concept class has to be augmented with the according hardware and software to reliably recognize (i. e. detect and classify) possible obstacles and incorporate them into the software framework for further planning. in this contribution we describe the additional components in hard-and software that are necessary to allow an opitcal 3d object detection. main criteria are accuracy, cost effectiveness, computational complexity for relative real time performance and ease of use with regard to incorporation in the existing software framework and possible extensibility. this paper consists of the following sections. at first, the schluckspecht v system is described in terms of hard-and software components for autonomous driving and the additional parts for the visual object recognition. the second part scrutinizes the object recognition pipeline. therefore, software frameworks, neural network architecture and final data fusion in a global map is depicted in detail. the contribution closes with an evaluation of the object recognition results and conclusions. the schluckspecht v is a self designed and self build vehicle according to the requirements of the eco marathon rules. the vehicle is depicted in figure 1 . the main features are the relatively large size, including driver cabin, motor area and a large trunk, a fully equipped lighting system and two doors that can be opened separately. for the autonomous driving challenges, the vehicle is additionally equipped with several essential parts, that are divided into hardware, consisting of actuators, sensors, computational hardware and communication controllers. the software is based on a middle ware, can-open communication layers, localization, mapping and path planning algorithms that are embedded into a high level state machine. actuators the car is equipped with two actors, one for steering and one for braking. each actor is paired with sensors for measuring steering angle and braking pressure. environmental sensors several sensors are needed for localization and mapping. backbone is a multilayer 3d laser scanning system (lidar), which is combined with an inertial navigation system that consists of accelerometers, gyroscopes and magnetic field sensors all realized as triads. odometry information is provided from a global navigation satellite system (gnss) and two wheel encoders. the communication is based on two separate can-bus-systems, one for basic operations and an additional one for the autonomous functions. the hardware can nodes are designed and build from the team coupling usb-, i2c-, spi-and can-open-interfaces. messages are send from the central processing unit or the driver depending on drive mode. the trunk of the car is equipped with an industrial grade high performance cpu and an additional graphics processing unit (gpu). can communication is ensured with an internal card, remote access is possible via generic wireless components. software structure the schluckspecht uses a modular software system consisting of several basic modules that are activated and combined within a high level state ma-chine as needed. an overview of the main modules and possible sensors and actuators is depicted in figure 2 localization and mapping the schluckspecht v is running a simultaneous localization and mapping (slam) framework for navigation, mission planning and environment representation. in its current version we use a graph based slam approach based upon the cartographer framework developed by google [2] . we calculate a dynamic occupancy grid map that can be used for further planning. sensor data is provided by the lidar, inertial navigation and odometry systems. an example of a drivable map is shown in figure 3 . this kind of map is also used as base for the localization and placement of the later detected obstacles. the maps are accurate to roughly 20 centimeters, providing relative localization towards obstacles or homing regions. path planning to make use of the slam created maps, an additional module calculates the motion commands from start to target pose of the car. the schluckspecht is a classical car like mobile system which means that the path planning must take into account the non holonomic kind of permitted movement. parking maneuvers, close by driving on obstacles or planning a trajectory between given points is realized as a combination of local control commands based upon modeled vehicle dynamics, the so called local planner, and optimization algorithms that find the globally most cost efficient path given a cost function, the so called global planner. we employ a kinodynamic strategy, the elastic band method presented in [3] , for the local planning. global planning is realized with a variant of the a* algorithm as described in [4] . middleware and communication all submodules, namely, localization, mapping, path planning and high-level state machines for each competition are implemented within ur-ai 2020 // 129 the robot operating system (ros) middleware [5] . ros provides a messaging system based upon the subscriber/publisher principle. the single modules are capsuled in a process, called node, capable to asynchronously exchange messages as needed. due to its open source character and an abundance on drivers and helper functions, ros provides additional features like hardware abstraction, device drivers, visualization and data storage. data structures for mobile robotic systems, e. g. static and dynamic maps or velocity control messages, allow for rapid development. the lidar sensor system has four rays, enabling only the incorporation of walls and track delimiters within a map. therefore, a stereo camera system is additionally implemented to allow for object detection of persons, other cars, traffic signs or visual parking space delimiters and simultaneously measure the distance of any environmental objects. camera hardware a zed-stereo-camera system is installed upon the car and incorporated into the ros framework. the system provides a color image streams for each camera and a depth map from stereo vision. the camera images are calibrated to each other and towards the depth information. the algorithms for disparity estimation are running around 50 frames per second making use of the provided gpu. the object recognition relies on deep neural networks. to seamlessly work with the other software parts and for easy integration, the networks are evaluated with tensorflow [6] and pytorch [7] frameworks. both are connected to ros via the opencv image formats providing ros-nodes and -topics for visualization and further processing. the object recognition pipeline relies on a combination of mono camera images and calibrated depth information to determine object and position. core algorithm is a deep learning approach with convolutional neural networks. ur-ai 2020 // 130 main contribution of this paper is the incorporation of a deep neural network object detector into our framework. object detection with deep neural networks can be subdivided into two approaches, one being a two step approach, where regions of interest are identified in a first step and classified in a second one. the second are so called single shot detectors (like [8] ), that extract and classify the objects in one network run. therefore, two network architectures are evaluated, namely yolov3 [9] as a single shot approach and faster r-cnn [10] as two step model. all are trained on public data sets and fine tuned to our setting by incorporating training images from the schluckspecht v in the zed image format. the models are pre-selected due to their real time capability in combination with the expected classification performance. this excludes the current best instance segmentation network mask r-cnn [11] due to computational burdens and fast but inaccurate networks based on the mobilenet backbone [12] . the class count is adapted for the contest, in the given case eight classes, including the relevant pedestrian, car, van, tram and cyclist. for this paper, the two chosen network architectures were trained in their respective framework, i. e. darknet for the yolov3 detector and tensorflow for the faster r-cnn detector. yolov3 is used in its standard form with the darknet 53 backbone, faster r-cnn is designed with the resnet 101 [13] backbone. the models were trained on local hardware with the kitti [14] data set. alternatively, an open source data set from the teaching company udacity, with only three classes (truck, car, pedestrian) was tested. to deal with the problem of domain adaptation, the training images for yolov3 were pre-processed to fit the aspect ratio of the zed camera. the faster r-cnn net can cope with ratio variations as it uses a two stage approach for detection based on regions of interest pooling. both networks were trained and stored. afterward, their are incorporated into the system via a ros node making use of standard python libraries. the detector output is represented by several labeled bounding boxes within the 2d image. three dimensional information is extracted from the associated depth map by calculating the center of gravity of each box to get a x and y coordinate within the image. interpolating the depth map pixels accordingly one gets the distance coordinate z from the depth map to determine the object position p(x, y, z) in the stereo camera coordinate system. the ease of projection between dieeferent coordinate systems is one reason to use the ros middleware. the complete vehicle is modeled in a so calle tranform tree (tf-tree), that allows the direct interpolation between different coordinate systems in all six spatial degrees of freedom. the dynamic map, created in the slam subsystem, is now augmented with the current obstacles in the car coordinate system. the local path planner can take these into account and plan a trajectory including kinodynamic constraints to prevent collision or initiate a breaking maneuver. both newly trained networks were first evaluated upon the training data. exemplary results for the kitti data set are shown in figure 4 . the results clearly indicate an advantage for the yolov3 system, both in speed and accuracy. the figure depicts good results for occlusions (e. g. the car on the upper right) or high object count (see the black car on the lower left as example). the evaluation on a desktop system showed 50 fps for yolov3 and approximately 10 fps for faster r-cnn. after validating the performance upon the training data, both networks were started as a ros node and tested upon real data of the schluckspecht vehicle. as the training data differs from the zed-camera images in format and resolution, several adaptions were necessary for the yolov3 detector. the images are cropped in real time before presented to the neural net to emulate the format of the training images. the r-cnn like two stage networks are directly connected to the zed node. the test data is not labeled as ground truth. it is therefore not possible to give quantitative results for the recognition task. table 1 gives a quantitative overview of the object detection and classification, the subsequent figures give some expression of exemplary results. the evaluation on the schluckspecht videos showed an advantage for the yolov3 network. main reason is the faster computation, which results in a frame rate nearly twice as high compared to two stage detectors. in addition, the recognition of objects in the distance, i. e. smaller objects is a strong point of yolo. the closer the camera gets, the bigger is the balance shift towards faster r-cnn, that outperforms yolo on all categories for larger objects. what becomes apparent is a maximum detection distance of approximately 30 meters, from which on cars become to small in size. figure 6 shows an additional result demonstrating the detection power for partially obstructed objects. another interesting finding was the capability of the networks to generalize. faster r-cnn copes much better with new object instances than yolov3. persons with so far unknown cloth color or darker areas with vehicles remain a problem for yolo, but ur-ai 2020 // 133 commonly not for the r-cnn. the domain transfer from training data in berkeley and kitti to real zed vehicle images proved problematic. this contribution describes an optical object recognition system in hard-and software for the application in autonomous driving under restricted conditions, within the shell eco marathon competition. an overall overview of the system and the incorporation of the detector within the framework is given. main focus was the evaluation and implementation of several neural network detectors, namely yolov3 as one shot detector and faster r-cnn as a two step detector, and their combination with distance information to gain a three dimensional information for detected objects. for the given application, the advantage clearly lies with yolov3. especially the achievable frame rate of minimum 10 hz allows a seamless integration into the localization and mapping framework. given the velocities and map update rate, the object recognition and integration via sensor fusion for path planning and navigation works in quasi real-time. for future applications we plan to further increase the detection quality by incorporating new classes and modern object detector frameworks like m2det [15] . this will additionally increase frame rate and bounding box quality. for more complex tasks, the data of the 3d-lidar system shall be directly incorporated into the fusion framework to enhance the perception of object boundaries and object velocities. a few useful things to know about machine learning feature engineering for machine learning an empirical analysis of feature engineering for predictive modeling input selection for fast feature engineering random forests support vector regression machines strong consistency of least squares estimates in multiple regression ii business data science: combining machine learning and economics to optimize, automate, and accelerate business decisions global product classification (gpc) a study of cross-validation and bootstrap for accuracy estimation and model selection automatic liver and tumor segmentation of ct and mri volumes using cascaded fully convolutional neural networks convolutional networks for biomedical image segmentation v-net: fully convolutional neural networks for volumetric medical image segmentation self-supervised learning for pore detection in ct-scans of cast aluminum parts generating meaningful synthetic ground truth for pore detection in cast aluminum parts nema ps3 / iso 12052, digital imaging and communications in medicine (dicom) standard, national electrical manufacturers association ct-realistic lung nodule simulation from 3d conditional generative adversarial networks for robust lung segmentation deep learning hardware: past, present, and future a survey on specialised hardware for machine learning a survey on distributed machine learning hardware for machine learning: challenges and opportunities 3d u-net: learning dense volumetric segmentation from sparse annotation z-net: an anisotropic 3d dcnn for medical ct volume segmentation activation functions: comparison of trends in practice and research for deep learning fast and accurate deep network learning by exponential linear units (elus) delving deep into rectifiers: surpassing human-level performance on imagenet classification toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity adam: a method for stochastic optimization diffgrad: an optimization method for convolutional neural networks tversky loss function for image segmentation using 3d fully convolutional deep networks a low-power multi physiological monitoring". processor for stress detection. ieee sensors using heart rate monitors to detect mental stress positive technology: a free mobile platform for the self-management of psychological stress exploring the effectiveness of a computer-based heart rate variability biofeedback program in reducing anxiety in college students psychological stress and incidence of atrial fibrillation continuously updated, computationally efficient stress recognition framework using electroencephalogram (eeg) by applying online multitask learning algorithms (omtl) ten years of research with the trier social stress test trapezius muscle emg as predictor of mental stress poptherapy: coping with stress through pop-culture du-md: an open-source human action dataset for ubiquitous wearable sensors stress recognition using wearable sensors and mobile phones introducing wesad, a multimodal dataset for wearable stress and affect detection feasibility and usability aspects of continuous remote monitoring of health status in palliative cancer patients using wearables detection of diseases based on electrocardiography and electroencephalography signals embedded in different devices: an exploratory study stress effects". the american institute of stress der smarte assistent can: creative adversarial networks, generating" art" by learning about styles and deviating from style norms creative ai: on the democratisation & escalation of creativity generative design: a paradigm for design research eigenfaces for recognition unsupervised representation learning with deep convolutional generative adversarial networks large scale gan training for high fidelity natural image synthesis interpreting the latent space of gans for semantic face editing visualizing and understanding generative adversarial networks mistaken identity spectral normalization for generative adversarial networks beauty is in the ease of the beholding: a neurophysiological test of the averageness theory of facial attractiveness unpaired image-to-image translation using cycleconsistent adversarial networks colorization for anime sketches with cycle-consistent adversarial network artificial muse using evolutionary design to interactively sketch car silhouettes and stimulate designer's creativity the chair project-four classics deepwear: a case study of collaborative design between human and artificial intelligence grass: generative recursive autoencoders for shape structures co-designing object shapes with artificial intelligence systematic review of the empirical evidence of study publication bias and outcome reporting bias ki-kunst und urheberrecht -die maschine als schöpferin? public law research paper no. 692; u of maryland legal studies research paper no inceptionism: going deeper into neural networks proactive error prevention in manufacturing based on an adaptable machine learning environment. artificial intelligence: from research to application: the upper-rhine artificial intelligence symposium ur-ai the benefits fo pdca crisp-dm 1.0: step-by-step data mining guide interpretable machine learning for quality engineering in manufacturing-importance measures that reveal insights on errors regulation (eu) 2017/745 of the european parliament and of the council of 5 april 2017 on medical devices -medical device regulation (mdr) use of real-world evidence to support regulatory decision-making for medical devices. guidance for industry and food and drug administration staff high-performance medicine: the convergence of human and artificial intelligence artificial intelligence powers digital medicine dermatologist-level classification of skin cancer with deep neural networks chestx-ray8: hospital-scale chest x-ray database and benchmarks on weaklysupervised classification and localization of common thorax diseases an attention based deep learning model of clinical events in the intensive care unit the artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care the european commission's high-level expert group on artificial intelligence: ethics guidelines for trustworthy ai key challenges for delivering clinical impact with artificial intelligence ibm's watson supercomputer recommended 'unsafe and incorrect' cancer treatments, internal documents show towards international standards for the evaluation of artificial intelligence for health proposed regulatory framework for modifications to artificial intelligence/machine learning (ai/ml)-based software as a medical device (samd) artificial-intelligence-and-machine-learning-discussion-paper.pdf 14. international medical device regulators forum (imdrf) -samd working group medical device software -software life-cycle processes general principles of software validation. final guidance for industry and fda staff deciding when to submit a 510(k) for a change to an existing device. guidance for industry and food and drug administration staff software as a medical device (samd): clinical evaluation. guidance for industry and food and drug administration staff international electrotechnical commission. iec 62366-1:2015 -part 1: application of usability engineering to medical devices why rankings of biomedical image analysis competitions should be interpreted with care what do we need to build explainable ai systems for the medical domain /679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (general data protection regulation -gdpr) artificial intelligence in healthcare: a critical analysis of the legal and ethical implications explainable artificial intelligence: understanding, visualizing and interpreting deep learning models association between race/ethnicity and survival of melanoma patients in the united states over 3 decades docket for feedback -proposed regulatory framework for modifications to artificial intelligence/machine learning (ai/ml)-based software as a medical device (samd) openpose: realtime multi-person 2d pose estimation using part affinity fields a density-based algorithm for discovering clusters in large spatial databases with noise fast volumetric auto-segmentation of head ct images in emergency situations for ventricular punctures a system for augmented reality guided ventricular puncture using a hololens: design, implementation and initial evaluation op sense-a robotic research platform for telemanipulated and automatic computer assisted surgery yolov3: an incremental improvement. arxiv deep learning based 3d pose estimation of surgical tools using a rgb-d camera at the example of a catheter for ventricular puncture fast point feature histograms (fpfh) for 3d registration joint probabilistic people detection in overlapping depth images towards end-to-end 3d human avatar shape reconstruction from 4d data scene-adaptive optimization scheme for depth sensor networks a taxonomy and evaluation of dense two-frame stereo correspondence algorithms advances in computational stereo a comparative analysis of cross-correlation matching algorithms using a pyramidal resolution approach fast approximate energy minimization via graph cuts stereo processing by semiglobal matching and mutual information guided stereo matching a large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation pyramid stereo matching network microstructure-sensitive design of a compliant beam microstructure sensitive design of an orthotropic plate subjected to tensile load microstructure sensitive design for performance optimization on the design, analysis, and characterization of materials using computational neural networks texture optimization of rolled aluminum alloy sheets using a genetic algorithm finite mixture models a tutorial on hidden markov models and selected applications in speech recognition information processing in dynamical systems: foundations of harmony theory generative adversarial nets building texture evolution networks for deformation processing of polycrystalline fcc metals using spectral approaches: applications to process design for targeted performance linear solution scheme for microstructure design with process constraints matcalo: knowledge-enabled machine learning in materials science differential evolution -a simple and efficient adaptive scheme for global optimization over continuous spaces reinforcement learning: an introduction hindsight experience replay industrieroboter für kmu. flexible und intuitive prozessbeschreibung toward efficient robot teach-in and semantic process descriptions for small lot sizes survey on human-robot collaboration in industrial settings: safety, intuitive interfaces and applications concept and architecture for programming industrial robots using augmented reality with mobile devices like microsoft hololens robot programming using augmented reality robot path and end-effector orientation planning using augmented reality spatial programming for industrial robots based on gestures and augmented reality spatial programming for industrial robots through task demonstration augmented reality based teaching pendant for industrial robot intuitive robot tasks with augmented reality and virtual obstacles development of a mixed reality based interface for human roboter interaction a hands-free virtual-reality teleoperation interface for wizard-of-oz control mixed reality as a tool supporting programming of the robot communicating robot arm motion intent through mixed reality head-mounted displays intuitive industrial robot programming through incremental multimodal language and augmented reality development of mixed reality robot control system based on hololens interactive spatial augmented reality in collaborative robot programming: user experience evaluation comparison of multimodal heading and pointing gestures for co-located mixed reality human-robot interaction robot programming through augmented trajectories in augmented reality interactive robot programming using mixed reality a taxonomy of mixed reality visual displays experimental packages for kuka manipulators within ros-indus-trial 24. siemens: ros-sharp a questionnaire for the evaluation of physical assistive devices (quead) unctad: review of maritime transport 2019 (2019) last accessed 2019-11-19. 2. international chamber of shipping report of the second meeting of the regional working group on illegal, unreported and unregulated (iuu) fishing automatic identification system (ais): data reliability and human error implications maritime anomaly detection: a review trajectorynet: an embedded gps trajectory representation for point-based classification using recurrent neural networks partition-wise recurrent neural networks for point-based ais trajectory classification a multi-task deep learning architecture for maritime surveillance using ais data streams identifying fishing activities from ais data with conditional random fields a segmental hmm based trajectory classification using genetic algorithm wissensbasierte probabilistische modellierung für die situationsanalyse am beispiel der maritimen überwachung detecting illegal diving and other suspicious activities in the north sea: tale of a successful trial source quality handling in fusion systems: a bayesian perspective tensorflow: large-scale machine learning on heterogeneous systems deep learning for time series classification: a review rapid object detection using a boosted cascade of simple features general framework for object detection a decision-theoretic generalization of on-line learning and an application to boosting imagenet classification with deep convolutional neural networks rethinking the inception architecture for computer vision going deeper with convolutions deep residual learning for image recognition mobilenets: efficient convolutional neural networks for mobile vision applications mobilenetv2: inverted residuals and linear bottlenecks rich feature hierarchies for accurate object detection and semantic segmentation ssd: single shot multibox detector deep learning for generic object detection: a survey pedestrian detection: an evaluation of the state of the art a survey on face detection in the wild: past, present and future text detection and recognition in imagery: a survey information visualizations used to avoid the problem of overfitting in supervised machine learning data science for business: what you need to know about data mining and data-analytic thinking automatic object detection from digital images by deep learning with transfer learning gpu asynchronous stochastic gradient descent to speed up neural network training tensorflow: tensorflow object detection api: ssd mobilenet v2 coco faster r-cnn: towards real-time object detection with region proposal networks tensorflow object detection api: faster rcnn inception v2 coco. online 29. tensorflow: tensorflow object detection api: faster rcnn inception v2 coco r-fcn: object detection via region-based fully convolutional networks 31. tensorflow: tensorflow object detection api: rfcn resnet101 coco multi-scale feature fusion single shot object detector based on densenet references 1. shell: the shell eco marathon real-time loop closure in 2d lidar slam kinodynamic trajectory optimization and control for car-like robots experiments with the graph traverser program robot operating system automatic differentiation in pytorch ssd: single shot multibox detector yolov3: an incremental improvement rich feature hierarchies for accurate object detection and semantic segmentation mobilenets: efficient convolutional neural networks for mobile vision applications deep residual learning for image recognition are we ready for autonomous driving? the kitti vision benchmark suite m2det: a single-shot object detector based on multi-level feature pyramid network the upper-rhine artificial intelligence symposium ur-ai 2020we thank our sponsor! main sponsor: esentri ag, ettlingen this research and development project is funded by the german federal ministry of education and research (bmbf) and the european social fund (esf) within the program "future of work" (02l17c550) and implemented by the project management agency karlsruhe (ptka). the author is responsible for the content of this publication. underlying projects to this article are funded by the wtd 81 of the german federal ministry of defense. the authors are responsible for the content of this article.this work was developed in the fraunhofer cluster of excellence "cognitive internet technologies". the upper-rhine artificial intelligence symposium